Slurm¶

Overview:

Questions

What is a scheduler
How can I use slurm to manage and run my jobs
What slurm commands can I use to explore the system

Objectives

Know that the scheduler manages jobs on the service
Know how to interact with slurm to:
- see what jobs are running
- cancel jobs
- find out information about my jobs

Job Scheduling¶

An HPC system might have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the scheduler. On an HPC system, the scheduler manages which jobs run where and when.

The following illustration compares these tasks of a job scheduler to a waiter in a restaurant. If you can relate to an instance where you had to wait for a while in a queue to get in to a popular restaurant, then you may now understand why sometimes your jobs do not start instantly as in your laptop.

Scheduler

Scheduler¶

Nimbus the cloud HPC service uses a scheduler to manage how jobs are run and resources are allocated.

If multiple jobs ran on a single node at the same time users would be competing for the same resources and jobs take longer to run overall. A scheduler manages individual jobs, which are allocated to the resources they need as they become available. This results in a higher overall throughput and more consistent performance.

The scheduler used by Nimbus is the same as that used by it's predecessor Balena, slurm: _Simple Linux Utility for Resource M_anagment.

Slurm: Simple Linux Utility for Resource Managment¶

The scheduler on Nimbus is slurm: _Simple Linux Utility for Resource M_anagment.

Interacting with the scheduler is done through the terminal using an array of commands.

Below are a number of key slurm commands:

Slurm command	Function
`sinfo`	View information about SLURM nodes and partitions
`squeue`	List status of jobs in the queue
`squeue --user [userid]`	Jobs by user
`squeue --job [jobid]`	Jobs by jobid
`sbatch [jobscript]`	Submit a jobscript to the scheduler
`scancel [jobid]`	Cancel a job in the queue
`scontrol hold [jobid]`	Hold a job in the queue
`scontrol release [jobid]`	Release a held job
`scontrol show job [jobid]`	View information about a job
`scontrol show node nodename`	Get information of a node
`scontrol show license`	Get licenses available on SLURM

Key Points:

We use a scheduler to manage jobs on the cloud HPC service
Nimbus uses the slurm scheduler
Key commands are:
- sbatch to submit jobs
- sinfo to view information about the service
- squeue to view the queue
- scancel to delete a job

You can find further information about slurm and the commands here: http://slurm.schedmd.com/

Slurm¶

Overview:

Job Scheduling¶

Scheduler¶

Slurm: Simple Linux Utility for Resource Managment¶

Key Points:

Previous

Schedule

Next