Running a job

Below are example submission scripts for Nimbus to illustrate the typical workflow2.

The following example can be adapted to your own needs. Typically they will be submitted from the /campaign/ storage area.

Slurm directives

Every line that starts with a #SBATCH is a directive for slurm this is where we tell slurm the resources we want for our job.

#SBATCH --account= tells slurm what account you wish to run against. The account code used in your run script #SBATCH --account=ACCOUNT_CODE should match the resource allocation you wish to run your job against. If you don't know your resource allocation code check the research computing account managemnet portal at rcam.bath.ac.uk (you need to be on the University's VPN All traffic), or ask your account administrator. or use the command sacctmgr show associations user=userid --parsable2. This command will also tell you the limits on the account, and what QOS (and partitions) you have access to.

#SBATCH --job-name= gives the job a name, which you can identify in the queue - you can call the job whatever you want that helps you to identify it.

#SBATCH --partition= tells slurm what partition to put the job on.

#SBATCH --qos= tells slurm what Quality of Service you wish to run with. The QOS is simply a way for admins to apply rules to the resources you can access, and the priority of the job. For Nimbus HPC systems the qos will match the partition name - and remember if you don't include it you will get an error.

#SBATCH --time= tells slurm how long you wish to run the job for, with several acceptable formats: "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds" e.g. 1-01:20:00 will request a runtime of 25 hours and 20 mins.

Submitting a simple job

Create a text file named output.txt with the following in your campaign folder.

this is my nimbus test

Hint: You can check what accounts you have access to with sacctmgr, and you will also have access to a folder /campaign/account_code (not case sensitive - the campaign folder will be in capitals).

First create a python file called test_script.py in your campaign directory containing the following:

with open('output.txt', 'w') as f:
    f.write('this is my nimbus test')

Then create a slurm script defining the parameters of the run called run_python_script.slm which contains the following:

#!/bin/bash
#SBATCH --account=RA-code
#SBATCH --job-name=JOB_NAME
#SBATCH --output=%x.%j.o
#SBATCH --error=%x.%j.e
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --partition=spot-fsv2-1
#SBATCH --qos=spot-fsv2-1
#SBATCH --time=00:01:00

# Load Modules
module purge
module load Python/3.9.5-GCCcore-10.3.0

# Do Run
python3 test_script.py

Finally to submit the script, in the command line run:

sbatch run_python_job.slm

DLPOLY

We will run an example job to run a DLPOLY simulation, using the DL_POLY_Classic/1.10-foss-2020b module on the spot_fsv2_16 instance including the --oversubscribe tag.

An example run script - create a file named run_job.slm with the following contents (remembering to replace the account code with that of your resource allocation):

#!/bin/bash
#SBATCH --account=BA-CH2FAM-001
#SBATCH --job-name=dlp
#SBATCH --output=dlp.%j.o
#SBATCH --error=dlp.%j.e
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --partition=spot-fsv2-16
#SBATCH --qos=spot-fsv2-16
#SBATCH --time=06:00:00

#Identify node correctly
source /apps/build/easy_build/scripts/id_instance.sh
source /apps/build/easy_build/scripts/setup_modules.sh

#Purge modules and load required modules
module purge
module load DL_POLY_Classic/1.10-foss-2020b

#Run job
mpirun --oversubscribe -np 16 --host $(hostname):16 DLPOLY.X

And to run the job issue the following command in the terminal:

sbatch run_job.slm

Note: The source commands in the slurm script helps the node to recognise its 'identity'

Interactive job

It is also possible to obtain a compute node and work interactively.

Issuing the following command in the terminal will request a spot-hbv3-120 compute instance for 6 hours:

srun --partition spot-hbv3-120 --nodes 1  --account prj4_phase1  --qos spot-hbv3-120 --job-name "interactive" --cpus-per-task 120 --time 6:00:00 --pty bash