Skip to content

Management of heterogeneous jobs in Slurm

Since version 17.11, Slurm has the ability to handle computations requiring heterogeneous resources. For example, a different amount of memory on certain cores or the ability to request simultaneous jobs on different partitions to perform a coupling.

Case study: coupling between 2 codes, visualization code, launching in client-server mode (slave/master)...

Submit jobs

Launching a heterogeneous job made of 2 jobs running at the same time, on different resources, means describing these 2 jobs in the same submission script.

#!/bin/bash

### Common part
#SBATCH --time 0:10:00
#SBATCH --error=job_heterogene.%J.err
#SBATCH --output=job_heterogene.%J.out

### Resources for the first job
#SBATCH -n 1 --mem-per-cpu 2000  --exclusive -p debug --time=0:10:00

#SBATCH packjob

### Ressources for the second job
#SBATCH -n 2 --mem-per-cpu 3000 -p knl --time=0:20:00

# Copy of files
cp fichiers_entree $LOCAL_WORK_DIR
cd $LOCAL_WORK_DIR


# launch of codes on the associated pack-group
srun --pack-group=0 code1 >> sortie_code1.txt
srun --pack-group=1 code2 >> sortie_code2.txt

The submission is identical to a normal job: sbatch script.sl

Another solution, in command line:

# Job on 3 cores (2 in the `debug` partition and 1 in the `knl` partition)
$ srun --label -n2 -pdebug : -n1 -pknl hostname
srun: job 8004622 queued and waiting for resources
srun: job 8004622 has been allocated resources
1: my010
0: my010
2: my362

Follow-up and stop of a job

The display of jobs in squeue changes: at the slurm level, it corresponds to 1 job containing sub-jobs. So they appear under the same number with "+num" corresponding to the packjob.

$ squeue -u monlogin
             JOBID PARTITION     NAME     USER    ST       TIME  NODES NODELIST(REASON)
         8003734+0     debug heteroge    monlogin PD       0:00      1 (None)
         8003734+1       knl heteroge    monlogin PD       0:00      1 (None)

To kill the job with its sub-jobs, just use scancel on the job number (in the above example: scancel 8003734)

To kill the +0 sub-job only, just specify scancel 8003734+0.

Subtlety

If the job is in pending state, then it is not possible to undo the sub-jobs, only the job and its sub-jobs.

Slurm documentation

Pour plus d'information, consulter la documentation de Slurm en prenant soin de choisir la version identique à celle de Slurm sur Myria. En octobre 2022, Myria est sous Slurm 20.02.7.

Documentation: https://slurm.schedmd.com/archive/slurm-20.02.7/heterogeneous_jobs.html


Last update: November 25, 2022 14:05:21