Management of heterogeneous jobs in Slurm¶
Since version 17.11, Slurm has the ability to handle computations requiring heterogeneous resources. For example, a different amount of memory on certain cores or the ability to request simultaneous jobs on different partitions to perform a coupling.
Case study: coupling between 2 codes, visualization code, launching in client-server mode (slave/master)...
Submit jobs¶
Launching a heterogeneous job made of 2 jobs running at the same time, on different resources, means describing these 2 jobs in the same submission script.
The submission is identical to a normal job: sbatch script.sl
Another solution, in command line:
# Job on 3 cores (2 in the `debug` partition and 1 in the `knl` partition)
$ srun --label -n2 -pdebug : -n1 -pknl hostname
srun: job 8004622 queued and waiting for resources
srun: job 8004622 has been allocated resources
1: my010
0: my010
2: my362
Follow-up and stop of a job¶
The display of jobs in squeue
changes: at the slurm level, it corresponds to 1 job containing sub-jobs. So they appear under the same number with "+num" corresponding to the packjob.
$ squeue -u monlogin
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8003734+0 debug heteroge monlogin PD 0:00 1 (None)
8003734+1 knl heteroge monlogin PD 0:00 1 (None)
To kill the job with its sub-jobs, just use scancel
on the job number (in the above example: scancel 8003734
)
To kill the +0
sub-job only, just specify scancel 8003734+0
.
Subtlety
If the job is in pending
state, then it is not possible to undo the sub-jobs, only the job and its sub-jobs.
Slurm documentation¶
Pour plus d'information, consulter la documentation de Slurm en prenant soin de choisir la version identique à celle de Slurm sur Myria. En octobre 2022, Myria est sous Slurm 20.02.7.
Documentation: https://slurm.schedmd.com/archive/slurm-20.02.7/heterogeneous_jobs.html