Job Scheduler

Overview:

  • Teaching: 30 min
  • Exercises: 30 min

Questions

  • What is a scheduler and why are they used?
  • How do I launch a program to run on a compute node?
  • How do I check limits on the cluster and my accounts?

Objectives

  • Run a simple job on the cluster's compute nodes.
  • Inspect the status of your job.
  • Inspect the output and error files of your job.

Job Scheduling

An HPC system might have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the scheduler. On an HPC system, the scheduler manages which jobs run where and when.

The following illustration compares these tasks of a job scheduler to a waiter in a restaurant. If you can relate to an instance where you had to wait for a while in a queue to get in to a popular restaurant, then you may now understand why sometimes your jobs do not start instantly as in your laptop.

Scheduler

Slurm: Simple Linux Utility for Resource Managment

Interacting with the scheduler is done through the terminal using an array of commands. If multiple jobs ran on a single node at the same time users would be competing for the same resources and jobs take longer to run overall. A scheduler manages individual jobs, which are allocated to the resources they need as they become available. This results in a higher overall throughput and more consistent performance.

The scheduler on our cluster container (and Nimbus) is slurm: _Simple Linux Utility for Resource M_anagment.

Interacting with the scheduler is done through the terminal using an array of commands.

Run a bash job

Using the text editor in the terminal create a bash script called test_job.sh with the following:

#!/bin/bash

echo 'This script is running on:'
hostname
sleep 60

And run it.

Where did the script run?

sbatch

If you completed the previous challenge successfully, you probably realise that there is a distinction between running the job through the scheduler and just “running it”. To submit this job to the scheduler, we use a command to submit the job to slurm:

jupyter-user:$ sbatch test_job.sh

Try this now. What happens?

Solution

When submitting a job you typically have to give the scheduler, slurm, information in order to help it decide where and when to run the job.

If there are lots of different compute nodes, for example, we might want to make sure our job goes on a particular CPU. And we will have to tell slurm how long our job will run for so it can make decisions about where to put jobs.

So lets have a look at a more complete job script:

#!/bin/bash

#SBATCH --account=prj0_phase1
#SBATCH --job-name=myjob
#SBATCH --partition=shortJob
#SBATCH --qos=shortJob
#SBATCH --time=0-00:25:00

echo 'This script is running on:'
hostname
sleep 60

Every line that starts with a #SBATCH is a directive for slurm this is where we tell slurm the resources we want for our job.

#SBATCH --account= tells slurm what account you wish to run against. Slurm will select a default account, f there is one, but it is better practice to make sure you include this. (In order to run a job on Nimbus you need to have a resource allocation, with funds, and it is this code that you put here)

#SBATCH --job-name= gives the job a name, which you can identify in the queue - you can call the job whatever you want that helps you to identify it.

#SBATCH --partition= tells slurm what partition to put the job on. Typically a HPC system will have different partitions which will have different resources or limits on them. For example, there may be a partion with a particular CPU type, or one with limits on the size or duration of job you can run.

#SBATCH --qos= tells slurm what Quality of Service you wish to run with. The QOS is simply a way for admins to apply rules to the resources you can access, and the priority of the job. For Bath's HPC systems the qos will match the partition name - and remember if you don't include it you will get an error.

#SBATCH --time= tells slurm how long you wish to run the job for, with several acceptable formats: "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds" e.g. 1-01:20:00 will request a runtime of 25 hours and 20 mins.

Running a batch job

Try editing your runscript to match the example above and submit it.

What happens?

Solution

sinfo

We get an error above because the time limit on the partition has been exceeded. We can use a slurm command sinfo to get information about what partitions are available:

jupyter-user:$sinfo
PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
bigJob           up 1-00:00:00      4   idle c[1-4]
shortJob         up      15:00      4   idle c[1-4]
notallowedjob    up      15:00      4   idle c[1-4]

We can see there are three different partitions. bigJob, shortJob and notallowedjob. All are available (we can see up in AVAIL column), next we can see the time limit on jobs allowed to run un that partition.

Now we can see why our job was rejected - we requested a 25 minute runtime on the shortjob partition, when the timelimit is only 15 minutes. Try reducing the time requested in the runscript and resubmitting.

sacctmgr

Yet another error:

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

So we need to check what accounts and QOS we have access to. We can do that with the sacctmgr command:

sacctmgr show associations user=jupyter-user --parsable2 format=account,user

This shows the associations between the QOS and our user account, output only what we are interested in (format=account,user) in a readbale format (--parsable2).

adding the account

Using the sacctmgr command take note of the account you have access to and update your submission script, and resunbmit it using the sbatch command.

Where was the script executed? Have any new files been created?

Solution

Redirecting output

We have seen that a new file was created by our successful submission slurm-<jobid>.out. We can tailor where our output goes with a couple of slurm directives:

#SBATCH --output= will tell where to put the output #SBATCH --error= will tell slurm where to send any error messages

We can also incorporate the job name with %x and the job id with %j to help identify our output files and ensure they arent overwritten:

Redirecting output

Include the following in your run script and resubmit.

#SBATCH --output=%x.%j.o
#SBATCH --error=%x.%j.e

What happens to your output now?

Solution

Monitoring your jobs

Obviously at some points your jobs may have to wait to run, if all the resources are currently being used. Jobs waiting to be run will be placed in a queue. There are several commands we can use to check on our jobs.

The command squeue will show us the current queue,

jupyter-user:$squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 7  shortJob    myjob jupyter- PD       0:00      1 (Resources)
                 8  shortJob    myjob jupyter- PD       0:00      1 (Priority)
                 9  shortJob    myjob jupyter- PD       0:00      1 (Priority)
                10  shortJob    myjob jupyter- PD       0:00      1 (Priority)
                11  shortJob    myjob jupyter- PD       0:00      1 (Priority)
                12  shortJob    myjob jupyter- PD       0:00      1 (Priority)
                13  shortJob    myjob jupyter- PD       0:00      1 (Priority)
                14  shortJob    myjob jupyter- PD       0:00      1 (Priority)
                15  shortJob    myjob jupyter- PD       0:00      1 (Priority)
                16  shortJob    myjob jupyter- PD       0:00      1 (Priority)
                17  shortJob    myjob jupyter- PD       0:00      1 (Priority)
                18  shortJob    myjob jupyter- PD       0:00      1 (Priority)
                19  shortJob    myjob jupyter- PD       0:00      1 (Priority)
                20  shortJob    myjob jupyter- PD       0:00      1 (Priority)
                21  shortJob    myjob jupyter- PD       0:00      1 (Priority)
                22  shortJob    myjob jupyter- PD       0:00      1 (Priority)
                23  shortJob    myjob jupyter- PD       0:00      1 (Priority)
                 3  shortJob    myjob jupyter-  R       0:32      1 c1
                 4  shortJob    myjob jupyter-  R       0:02      1 c2
                 5  shortJob    myjob jupyter-  R       0:02      1 c3
                 6  shortJob    myjob jupyter-  R       0:02      1 c4

with a numeric jobid (JOBID), the partition details (PARTITION), the job name (NAME), user that owns the job (USER), the job state (ST), runtime (TIME), the number of nodes requested (NODES), and the nodes it is running on or a reason why it is queued NODELIST(REASON).

squeue will accept arguments (remember you can see what arguments with the command man squeue) but probably the most useful one is to pass your username thorugh with squeue -u jupyterhub-user which will show only your jobs.

There is another command that will produce more detailed information about your jobs:

jupyter-user:$scontrol show job <jobid>
JobId=44 JobName=myjob
   UserId=jupyter-cor22(1000) GroupId=jupyter-cor22(1000) MCS_label=N/A
   Priority=4294901716 Nice=0 Account=prj0_phase1 QOS=shortjob
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:03 TimeLimit=00:15:00 TimeMin=N/A
   SubmitTime=2023-02-06T22:07:26 EligibleTime=2023-02-06T22:07:26
   AccrueTime=2023-02-06T22:07:26
   StartTime=2023-02-06T22:07:26 EndTime=2023-02-06T22:22:26 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-02-06T22:07:26 Scheduler=Main
   Partition=shortJob AllocNode:Sid=login:606
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=c1
   BatchHost=c1
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=500M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/data/jupyter-cor22/test.sh
   WorkDir=/data/jupyter-cor22
   StdErr=/data/jupyter-cor22/myjob.44.e
   StdIn=/dev/null
   StdOut=/data/jupyter-cor22/myjob.44.o
   Power=

Try submitting a job and monitoring it with squeue and scontrol now. Note the different job states in shorthand from squeue (R,PD etc) and long form from scontrol (Running,Pending etc).

Cancelling a job

The last command you should know is the one to cancel a job, as mistakes in job submissions can and often do happen.

To cancel a job you use the command:

jupyter-user:$scancel <jobid>

Where <jobid> is the numeric job id you can get from squeue or scontrol.

Try submitting and cancelling a job now.

Summary

Here is alist of the key slurm command you will need to interact with the scheduler.

Slurm command Function
sinfo View information about SLURM nodes and partitions
squeue List status of jobs in the queue
squeue --user [userid] Jobs by user
squeue --job [jobid] Jobs by jobid
sbatch [jobscript] Submit a jobscript to the scheduler
scancel [jobid] Cancel a job in the queue
scontrol hold [jobid] Hold a job in the queue
scontrol release [jobid] Release a held job
scontrol show job [jobid] View information about a job
scontrol show node nodename Get information of a node
scontrol show license Get licenses available on SLURM

You can find out more details in the slurm documentation here: http://slurm.schedmd.com/

Key Points:

  • We use a scheduler to manage jobs on the cloud HPC service
  • Nimbus uses the slurm scheduler
  • Key commands are:
    • sbatch to submit jobs
    • sinfo to view information about the service
    • squeue to view the queue
    • scancel to delete a job
    • scontrol to show and control jobs
    • sacctmgr to query account details

You can find further information about slurm and the commands here: http://slurm.schedmd.com/