Skip to content

Management of signals sent by Slurm

It is possible to set Slurm to send a signal to the job a few seconds or minutes before the time limit (TimeLimit). This signal is interceptable and can be used, for example, to send the order to generate a recovery file or to submit a new job.

Signals under Linux

Essentially, signals are used by the Linux kernel to warn processes of events (illegal instruction, invalid memory addressing, etc.). Some signals are interceptable by the recipient process in order to execute an associated action. Other signals kill the receiving process.

Signals can also be used between processes to communicate.

Example of a classic signal : a CTRL + C on a keyboard on a background process sends it a SIGTERM signal.

A series of signals are defined at the system level for use by users. These are the SIGUSR1 and SIGUSR2 signals.
The command to send a signal to a process is the kill command.

Signal management in Slurm

By default, when the time limit of a computation is reached, Slurm sends a SIGTERM signal to all processes to kill them and end the computation. The processes then have about 30s to finish. After this time, if there are still processes, they receive a SIGKILL signal. The SIGTERM signal is an interceptable signal, contrary to the SIGKILL signal which corresponds to an immediate and sure death.

In addition, it is possible to ask Slurm to send a chosen signal, a certain time before reaching the time limit. The process can then intercept it and perform tasks such as generating retry files, cleaning up data or resubmitting jobs.

Example of use

The following script is in several blocks:

  • block of SLURM commands (lines #SBATCH)
  • block of function declaration and association with the signal
  • code block with a & on the srun command
  • end of script block with a wait and then the copy of the files
#!/bin/bash

#SBATCH --mem-per-cpu=3000
#SBATCH -n 1
#SBATCH -t 0:5:00
#SBATCH -p debug
# asks SLURM to send the SIGUSR1 signal 120 seconds before end of the time limit
#SBATCH --signal=B:SIGUSR1@120


## Handle function ##
# Function executed 120s before the end of the time limit
function sig_handler_USR1()
{
        echo "   function sig_handler_USR1 called"
        # do whatever cleanup you want here
           echo "   Signal trapped -  `date`"
           # Do what you want :
           #    save data ...
           #    cleanup ...
           #    requeue job ...
           #    send signal to MPI job ...
        exit 2
}

## Handle function association ##
# associate the function "sig_handler_USR1" with the USR1 signal
trap 'sig_handler_USR1' SIGUSR1


## Job script ##
cd $LOCAL_WORK_DIR
srun sleep 40000 &


# Let's wait for signals or end of all background commands
wait

# This is the place to move your data files to your home-dir as a normal job
mkdir $SLURM_SUBMIT_DIR/$SLURM_JOB_ID
mv *.log *.dat $SLURM_SUBMIT_DIR/$SLURM_JOB_ID

exit 0

Advanced information

Parameter --signal

Format: --signal=[B:]signal[@duration]

  • If no duration is specified, a signal will be sent 60s before the end of the time limit.
  • If "B" is not specified, then the signal is sent to all processes except to the startup script. If it is specified, the signal is sent to the start script only.
  • Signals can be specified by number or name (e.g. 10 or SIGUSR1). To see the list of signals available on Myria, from a front end, type the command man 7 signal.

You may prefer that the signal be sent to your C or Fortran code in order to make it perform a specific action (e.g. generation of a recovery file) rather than by the submission script.

trap command

The trap command is used to associate a command or function with a signal: if the submission script receives this signal, its execution is stopped to execute the associated command or function.
In the example above, the trap 'sig_handler_USR1' SIGUSR1 command defines the execution of the sig_handler_USR1 function when the SIGUSR1 signal is received by the submission script.

The name of the signal can be modified or specified by a numerical value (e.g. SIGUSR1 corresponds to signals 10, 30 and 16).
The function can be replaced by a command.

It is possible to declare several interceptions of different signals in the same script by adding more trap lines.

wait command

The end of a submission script triggers the end of the job and the sending of SIGKILL signals to them.

The wait command, as its name implies, waits until the child processes (srun ... &) are finished before continuing to execute the following lines.

If the srun commands are not executed in the background, the bash script will not be able to intercept the signals.

If the job contains several steps (srun commands) to be run successively, you should alternate srun ... & and wait in order to wait for the end of the previous command.

SIGTERM signal

As mentioned above, Slurm sends a SIGTERM signal when the time limit is reached. It is quite possible to add an intercept to this signal as well. In this case, you just need to add a trap line to the signal and associate it with a function.

If you have also used the --signal option, at the end of the associated function, switch back to wait mode to wait for the end of the job, because if the submit script ends, it kills its children (srun commands). This corresponds to replacing the exit 2 command in the example with the wait command

scancel command

The scancel command has a similar behavior to the kill command: it doesn't "kill", it sends a signal to a computation...

The default signal sent is a succession of 3 signals: SIGCONT then SIGTERM and about 30s later SIGKILL.

It is possible to send another signal via the scancel command, using the --signal parameter. In order for the signal to be sent only to the submission script, the -b or --batch option must be added.

example: scancel --batch --signal=SIGUSR1 job_number

Exit code

If you define a different exit code in the sig_handler_USR1 function (e.g. exit 2), from the exit code and end of script (e.g. exit 0), you will keep track of how your job exits:

  • 0 → normal end of script,
  • 2 → normal end of script thanks to the 120s signal before the time limit

To find the exit codes of a previous job, use the command sacct -j job_id. The "ExitCode" column uses the format Error_Code:Received_Signal. The submission script corresponds to the "batch" line.

References


Last update: November 25, 2022 14:05:21