Management of signals sent by Slurm¶
It is possible to set Slurm to send a signal to the job a few seconds or minutes before the time limit (TimeLimit). This signal is interceptable and can be used, for example, to send the order to generate a recovery file or to submit a new job.
Signals under Linux¶
Essentially, signals are used by the Linux kernel to warn processes of events (illegal instruction, invalid memory addressing, etc.). Some signals are interceptable by the recipient process in order to execute an associated action. Other signals kill the receiving process.
Signals can also be used between processes to communicate.
Example of a classic signal : a CTRL + C on a keyboard on a background process sends it a SIGTERM
signal.
A series of signals are defined at the system level for use by users. These are the SIGUSR1
and SIGUSR2
signals.
The command to send a signal to a process is the kill
command.
Signal management in Slurm¶
By default, when the time limit of a computation is reached, Slurm sends a SIGTERM
signal to all processes to kill them and end the computation. The processes then have about 30s to finish. After this time, if there are still processes, they receive a SIGKILL
signal. The SIGTERM
signal is an interceptable signal, contrary to the SIGKILL
signal which corresponds to an immediate and sure death.
In addition, it is possible to ask Slurm to send a chosen signal, a certain time before reaching the time limit. The process can then intercept it and perform tasks such as generating retry files, cleaning up data or resubmitting jobs.
Example of use¶
The following script is in several blocks:
- block of SLURM commands (lines
#SBATCH
) - block of function declaration and association with the signal
- code block with a
&
on thesrun
command - end of script block with a
wait
and then the copy of the files
Advanced information¶
Parameter --signal
¶
Format: --signal=[B:]signal[@duration]
- If no duration is specified, a signal will be sent 60s before the end of the time limit.
- If "B" is not specified, then the signal is sent to all processes except to the startup script. If it is specified, the signal is sent to the start script only.
- Signals can be specified by number or name (e.g. 10 or
SIGUSR1
). To see the list of signals available on Myria, from a front end, type the commandman 7 signal
.
You may prefer that the signal be sent to your C or Fortran code in order to make it perform a specific action (e.g. generation of a recovery file) rather than by the submission script.
trap
command¶
The trap
command is used to associate a command or function with a signal: if the submission script receives this signal, its execution is stopped to execute the associated command or function.
In the example above, the trap 'sig_handler_USR1' SIGUSR1
command defines the execution of the sig_handler_USR1
function when the SIGUSR1
signal is received by the submission script.
The name of the signal can be modified or specified by a numerical value (e.g. SIGUSR1
corresponds to signals 10, 30 and 16).
The function can be replaced by a command.
It is possible to declare several interceptions of different signals in the same script by adding more trap
lines.
wait
command¶
The end of a submission script triggers the end of the job and the sending of SIGKILL
signals to them.
The wait
command, as its name implies, waits until the child processes (srun ... &
) are finished before continuing to execute the following lines.
If the srun
commands are not executed in the background, the bash script will not be able to intercept the signals.
If the job contains several steps (srun
commands) to be run successively, you should alternate srun ... &
and wait
in order to wait for the end of the previous command.
SIGTERM
signal¶
As mentioned above, Slurm sends a SIGTERM
signal when the time limit is reached. It is quite possible to add an intercept to this signal as well. In this case, you just need to add a trap
line to the signal and associate it with a function.
If you have also used the --signal
option, at the end of the associated function, switch back to wait
mode to wait for the end of the job, because if the submit script ends, it kills its children (srun commands). This corresponds to replacing the exit 2
command in the example with the wait
command
scancel
command¶
The scancel
command has a similar behavior to the kill
command: it doesn't "kill", it sends a signal to a computation...
The default signal sent is a succession of 3 signals: SIGCONT
then SIGTERM
and about 30s later SIGKILL
.
It is possible to send another signal via the scancel
command, using the --signal
parameter. In order for the signal to be sent only to the submission script, the -b
or --batch
option must be added.
example: scancel --batch --signal=SIGUSR1 job_number
Exit code¶
If you define a different exit code in the sig_handler_USR1
function (e.g. exit 2
), from the exit code and end of script (e.g. exit 0
), you will keep track of how your job exits:
- 0 → normal end of script,
- 2 → normal end of script thanks to the 120s signal before the time limit
To find the exit codes of a previous job, use the command sacct -j job_id
. The "ExitCode" column uses the format Error_Code
:Received_Signal
. The submission script corresponds to the "batch" line.
References¶
- Slurm
sbatch
command options: https://slurm.schedmd.com/sbatch.html - Slurm
scancel
command options: https://slurm.schedmd.com/scancel.html