Management of signals sent by Slurm¶
It is possible to set Slurm to send a signal to the job a few seconds or minutes before the time limit (TimeLimit). This signal is interceptable and can be used, for example, to send the order to generate a recovery file or to submit a new job.
Signals under Linux¶
Essentially, signals are used by the Linux kernel to warn processes of events (illegal instruction, invalid memory addressing, etc.). Some signals are interceptable by the recipient process in order to execute an associated action. Other signals kill the receiving process.
Signals can also be used between processes to communicate.
Example of a classic signal : a CTRL + C on a keyboard on a background process sends it a
A series of signals are defined at the system level for use by users. These are the
The command to send a signal to a process is the
Signal management in Slurm¶
By default, when the time limit of a computation is reached, Slurm sends a
SIGTERM signal to all processes to kill them and end the computation. The processes then have about 30s to finish. After this time, if there are still processes, they receive a
SIGKILL signal. The
SIGTERM signal is an interceptable signal, contrary to the
SIGKILL signal which corresponds to an immediate and sure death.
In addition, it is possible to ask Slurm to send a chosen signal, a certain time before reaching the time limit. The process can then intercept it and perform tasks such as generating retry files, cleaning up data or resubmitting jobs.
Example of use¶
The following script is in several blocks:
- block of SLURM commands (lines
- block of function declaration and association with the signal
- code block with a
- end of script block with a
waitand then the copy of the files
- If no duration is specified, a signal will be sent 60s before the end of the time limit.
- If "B" is not specified, then the signal is sent to all processes except to the startup script. If it is specified, the signal is sent to the start script only.
- Signals can be specified by number or name (e.g. 10 or
SIGUSR1). To see the list of signals available on Myria, from a front end, type the command
man 7 signal.
You may prefer that the signal be sent to your C or Fortran code in order to make it perform a specific action (e.g. generation of a recovery file) rather than by the submission script.
trap command is used to associate a command or function with a signal: if the submission script receives this signal, its execution is stopped to execute the associated command or function.
In the example above, the
trap 'sig_handler_USR1' SIGUSR1 command defines the execution of the
sig_handler_USR1 function when the
SIGUSR1 signal is received by the submission script.
The name of the signal can be modified or specified by a numerical value (e.g.
SIGUSR1 corresponds to signals 10, 30 and 16).
The function can be replaced by a command.
It is possible to declare several interceptions of different signals in the same script by adding more
The end of a submission script triggers the end of the job and the sending of
SIGKILL signals to them.
wait command, as its name implies, waits until the child processes (
srun ... &) are finished before continuing to execute the following lines.
srun commands are not executed in the background, the bash script will not be able to intercept the signals.
If the job contains several steps (
srun commands) to be run successively, you should alternate
srun ... & and
wait in order to wait for the end of the previous command.
As mentioned above, Slurm sends a
SIGTERM signal when the time limit is reached. It is quite possible to add an intercept to this signal as well. In this case, you just need to add a
trap line to the signal and associate it with a function.
If you have also used the
--signal option, at the end of the associated function, switch back to
wait mode to wait for the end of the job, because if the submit script ends, it kills its children (srun commands). This corresponds to replacing the
exit 2 command in the example with the
scancel command has a similar behavior to the
kill command: it doesn't "kill", it sends a signal to a computation...
The default signal sent is a succession of 3 signals:
SIGTERM and about 30s later
It is possible to send another signal via the
scancel command, using the
--signal parameter. In order for the signal to be sent only to the submission script, the
--batch option must be added.
scancel --batch --signal=SIGUSR1 job_number
If you define a different exit code in the
sig_handler_USR1 function (e.g.
exit 2), from the exit code and end of script (e.g.
exit 0), you will keep track of how your job exits:
- 0 → normal end of script,
- 2 → normal end of script thanks to the 120s signal before the time limit
To find the exit codes of a previous job, use the command
sacct -j job_id. The "ExitCode" column uses the format
Received_Signal. The submission script corresponds to the "batch" line.
sbatchcommand options: https://slurm.schedmd.com/sbatch.html
scancelcommand options: https://slurm.schedmd.com/scancel.html