Slurm

Overview:

  • Teaching: 10 min
  • Exercises: 0 min

Questions

  • What is a scheduler
  • How can I use slurm to manage and run my jobs
  • What slurm commands can I use to explore the system

Objectives

  • Know that the scheduler manages jobs on the service
  • Know how to interact with slurm to:
    • see what jobs are running
    • cancel jobs
    • find out information about my jobs

Job Scheduling

An HPC system might have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the scheduler. On an HPC system, the scheduler manages which jobs run where and when.

The following illustration compares these tasks of a job scheduler to a waiter in a restaurant. If you can relate to an instance where you had to wait for a while in a queue to get in to a popular restaurant, then you may now understand why sometimes your jobs do not start instantly as in your laptop.

Scheduler

Scheduler

Nimbus the cloud HPC service uses a scheduler to manage how jobs are run and resources are allocated.

If multiple jobs ran on a single node at the same time users would be competing for the same resources and jobs take longer to run overall. A scheduler manages individual jobs, which are allocated to the resources they need as they become available. This results in a higher overall throughput and more consistent performance.

The scheduler used by Nimbus is the same as that used by it's predecessor Balena, slurm: _Simple Linux Utility for Resource M_anagment.

Slurm: Simple Linux Utility for Resource Managment

The scheduler on Nimbus is slurm: _Simple Linux Utility for Resource M_anagment.

Interacting with the scheduler is done through the terminal using an array of commands.

Below are a number of key slurm commands:

Slurm command Function
sinfo View information about SLURM nodes and partitions
squeue List status of jobs in the queue
squeue --user [userid] Jobs by user
squeue --job [jobid] Jobs by jobid
sbatch [jobscript] Submit a jobscript to the scheduler
scancel [jobid] Cancel a job in the queue
scontrol hold [jobid] Hold a job in the queue
scontrol release [jobid] Release a held job
scontrol show job [jobid] View information about a job
scontrol show node nodename Get information of a node
scontrol show license Get licenses available on SLURM

Key Points:

  • We use a scheduler to manage jobs on the cloud HPC service
  • Nimbus uses the slurm scheduler
  • Key commands are:
    • sbatch to submit jobs
    • sinfo to view information about the service
    • squeue to view the queue
    • scancel to delete a job

You can find further information about slurm and the commands here: http://slurm.schedmd.com/