An HPC system might have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the scheduler. On an HPC system, the scheduler manages which jobs run where and when.
The following illustration compares these tasks of a job scheduler to a waiter in a restaurant. If you can relate to an instance where you had to wait for a while in a queue to get in to a popular restaurant, then you may now understand why sometimes your jobs do not start instantly as in your laptop.
Nimbus the cloud HPC service uses a scheduler to manage how jobs are run and resources are allocated.
If multiple jobs ran on a single node at the same time users would be competing for the same resources and jobs take longer to run overall. A scheduler manages individual jobs, which are allocated to the resources they need as they become available. This results in a higher overall throughput and more consistent performance.
The scheduler used by Nimbus is the same as that used by it's predecessor Balena, slurm
: _Simple Linux Utility for Resource M_anagment.
The scheduler on Nimbus is slurm
: _Simple Linux Utility for Resource M_anagment.
Interacting with the scheduler is done through the terminal using an array of commands.
Below are a number of key slurm
commands:
Slurm command | Function |
---|---|
sinfo |
View information about SLURM nodes and partitions |
squeue |
List status of jobs in the queue |
squeue --user [userid] |
Jobs by user |
squeue --job [jobid] |
Jobs by jobid |
sbatch [jobscript] |
Submit a jobscript to the scheduler |
scancel [jobid] |
Cancel a job in the queue |
scontrol hold [jobid] |
Hold a job in the queue |
scontrol release [jobid] |
Release a held job |
scontrol show job [jobid] |
View information about a job |
scontrol show node nodename |
Get information of a node |
scontrol show license |
Get licenses available on SLURM |