Questions
Objectives
An HPC system might have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the scheduler. On an HPC system, the scheduler manages which jobs run where and when.
The following illustration compares these tasks of a job scheduler to a waiter in a restaurant. If you can relate to an instance where you had to wait for a while in a queue to get in to a popular restaurant, then you may now understand why sometimes your jobs do not start instantly as in your laptop.
Interacting with the scheduler is done through the terminal using an array of commands. If multiple jobs ran on a single node at the same time users would be competing for the same resources and jobs take longer to run overall. A scheduler manages individual jobs, which are allocated to the resources they need as they become available. This results in a higher overall throughput and more consistent performance.
The scheduler on our cluster container (and Nimbus) is slurm
: _Simple Linux Utility for Resource M_anagment.
Interacting with the scheduler is done through the terminal using an array of commands.
When submitting a job you typically have to give the scheduler, slurm, information in order to help it decide where and when to run the job.
If there are lots of different compute nodes, for example, we might want to make sure our job goes on a particular CPU. And we will have to tell slurm
how long our job will run for so it can make decisions about where to put jobs.
So lets have a look at a more complete job script:
#!/bin/bash
#SBATCH --account=prj0_phase1
#SBATCH --job-name=myjob
#SBATCH --partition=shortJob
#SBATCH --qos=shortJob
#SBATCH --time=0-00:25:00
echo 'This script is running on:'
hostname
sleep 60
Every line that starts with a #SBATCH
is a directive for slurm
this is where we tell slurm the resources we want for our job.
#SBATCH --account=
tells slurm what account you wish to run against. Slurm will select a default account, f there is one, but it is better practice to make sure you include this. (In order to run a job on Nimbus you need to have a resource allocation, with funds, and it is this code that you put here)
#SBATCH --job-name=
gives the job a name, which you can identify in the queue - you can call the job whatever you want that helps you to identify it.
#SBATCH --partition=
tells slurm what partition to put the job on. Typically a HPC system will have different partitions which will have different resources or limits on them. For example, there may be a partion with a particular CPU type, or one with limits on the size or duration of job you can run.
#SBATCH --qos=
tells slurm what Quality of Service you wish to run with. The QOS is simply a way for admins to apply rules to the resources you can access, and the priority of the job. For Bath's HPC systems the qos will match the partition name - and remember if you don't include it you will get an error.
#SBATCH --time=
tells slurm how long you wish to run the job for, with several acceptable formats: "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds" e.g. 1-01:20:00
will request a runtime of 25 hours and 20 mins.
We get an error above because the time limit on the partition has been exceeded. We can use a slurm
command sinfo
to get information about what partitions are available:
jupyter-user:$sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
bigJob up 1-00:00:00 4 idle c[1-4]
shortJob up 15:00 4 idle c[1-4]
notallowedjob up 15:00 4 idle c[1-4]
We can see there are three different partitions. bigJob
, shortJob
and notallowedjob
. All are available (we can see up in AVAIL
column), next we can see the time limit on jobs allowed to run un that partition.
Now we can see why our job was rejected - we requested a 25 minute runtime on the shortjob
partition, when the timelimit is only 15 minutes.
Try reducing the time requested in the runscript and resubmitting.
Yet another error:
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
So we need to check what accounts and QOS we have access to. We can do that with the sacctmgr
command:
sacctmgr show associations user=jupyter-user --parsable2 format=account,user
This shows the associations between the QOS and our user account, output only what we are interested in (format=account,user
) in a readbale format (--parsable2
).
We have seen that a new file was created by our successful submission slurm-<jobid>.out
. We can tailor where our output goes with a couple of slurm directives:
#SBATCH --output=
will tell where to put the output
#SBATCH --error=
will tell slurm where to send any error messages
We can also incorporate the job name with %x
and the job id with %j
to help identify our output files and ensure they arent overwritten:
Obviously at some points your jobs may have to wait to run, if all the resources are currently being used. Jobs waiting to be run will be placed in a queue. There are several commands we can use to check on our jobs.
The command squeue
will show us the current queue,
jupyter-user:$squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
7 shortJob myjob jupyter- PD 0:00 1 (Resources)
8 shortJob myjob jupyter- PD 0:00 1 (Priority)
9 shortJob myjob jupyter- PD 0:00 1 (Priority)
10 shortJob myjob jupyter- PD 0:00 1 (Priority)
11 shortJob myjob jupyter- PD 0:00 1 (Priority)
12 shortJob myjob jupyter- PD 0:00 1 (Priority)
13 shortJob myjob jupyter- PD 0:00 1 (Priority)
14 shortJob myjob jupyter- PD 0:00 1 (Priority)
15 shortJob myjob jupyter- PD 0:00 1 (Priority)
16 shortJob myjob jupyter- PD 0:00 1 (Priority)
17 shortJob myjob jupyter- PD 0:00 1 (Priority)
18 shortJob myjob jupyter- PD 0:00 1 (Priority)
19 shortJob myjob jupyter- PD 0:00 1 (Priority)
20 shortJob myjob jupyter- PD 0:00 1 (Priority)
21 shortJob myjob jupyter- PD 0:00 1 (Priority)
22 shortJob myjob jupyter- PD 0:00 1 (Priority)
23 shortJob myjob jupyter- PD 0:00 1 (Priority)
3 shortJob myjob jupyter- R 0:32 1 c1
4 shortJob myjob jupyter- R 0:02 1 c2
5 shortJob myjob jupyter- R 0:02 1 c3
6 shortJob myjob jupyter- R 0:02 1 c4
with a numeric jobid (JOBID), the partition details (PARTITION), the job name (NAME), user that owns the job (USER), the job state (ST), runtime (TIME), the number of nodes requested (NODES), and the nodes it is running on or a reason why it is queued NODELIST(REASON).
squeue
will accept arguments (remember you can see what arguments with the command man squeue
) but probably the most useful one is to pass your username thorugh with squeue -u jupyterhub-user
which will show only your jobs.
There is another command that will produce more detailed information about your jobs:
jupyter-user:$scontrol show job <jobid>
JobId=44 JobName=myjob
UserId=jupyter-cor22(1000) GroupId=jupyter-cor22(1000) MCS_label=N/A
Priority=4294901716 Nice=0 Account=prj0_phase1 QOS=shortjob
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:03 TimeLimit=00:15:00 TimeMin=N/A
SubmitTime=2023-02-06T22:07:26 EligibleTime=2023-02-06T22:07:26
AccrueTime=2023-02-06T22:07:26
StartTime=2023-02-06T22:07:26 EndTime=2023-02-06T22:22:26 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-02-06T22:07:26 Scheduler=Main
Partition=shortJob AllocNode:Sid=login:606
ReqNodeList=(null) ExcNodeList=(null)
NodeList=c1
BatchHost=c1
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=500M,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/data/jupyter-cor22/test.sh
WorkDir=/data/jupyter-cor22
StdErr=/data/jupyter-cor22/myjob.44.e
StdIn=/dev/null
StdOut=/data/jupyter-cor22/myjob.44.o
Power=
Try submitting a job and monitoring it with squeue
and scontrol
now. Note the different job states in shorthand from squeue
(R
,PD
etc) and long form from scontrol
(Running
,Pending etc
).
Here is alist of the key slurm
command you will need to interact with the scheduler.
Slurm command | Function |
---|---|
sinfo |
View information about SLURM nodes and partitions |
squeue |
List status of jobs in the queue |
squeue --user [userid] |
Jobs by user |
squeue --job [jobid] |
Jobs by jobid |
sbatch [jobscript] |
Submit a jobscript to the scheduler |
scancel [jobid] |
Cancel a job in the queue |
scontrol hold [jobid] |
Hold a job in the queue |
scontrol release [jobid] |
Release a held job |
scontrol show job [jobid] |
View information about a job |
scontrol show node nodename |
Get information of a node |
scontrol show license |
Get licenses available on SLURM |
You can find out more details in the slurm documentation here: http://slurm.schedmd.com/