Running a job

Below are example submission scripts for Nimbus to illustrate the typical workflow2.

These examples copy the run data from the submission directory in /campaign/ area to the fast local disk mounted at /mnt/resource/ before the run, and copy any necessary files back to /campaign/ afterwards. The script then cleans up by removing the working_directory in the /mnt/resource area (this will be done automatically, but is perhaps a good habit to get into for furture developments in terms of storage were this may not be the case).

Obviously there are an infinite number of ways to structure the logic in your submission scripts in order to handle the data transfer between /campaign/ and /mnt/resource.

The following examples can be adapted to your own needs. Typically they will be submitted from the /campaign/ storage area.

Finally - if the local disk found in /mnt/resource/ is not big enough for your outputs then you can use the $BURSTBUFFER environment variable which points to a folder set up with each run, specifically for that run.

Slurm directives

Every line that starts with a #SBATCH is a directive for slurm this is where we tell slurm the resources we want for our job.

#SBATCH --account= tells slurm what account you wish to run against. The account code used in your run script #SBATCH --account=ACCOUNT_CODE should match the resource allocation you wish to run your job against. If you don't know your resource allocation code check the research computing account managemnet portal at rcam.bath.ac.uk (you need to be on the University's VPN All traffic), or ask your account administrator. or use the command sacctmgr show associations user=userid --parsable2. This command will also tell you the limits on the account, and what QOS (and partitions) you have access to.

#SBATCH --job-name= gives the job a name, which you can identify in the queue - you can call the job whatever you want that helps you to identify it.

#SBATCH --partition= tells slurm what partition to put the job on.

#SBATCH --qos= tells slurm what Quality of Service you wish to run with. The QOS is simply a way for admins to apply rules to the resources you can access, and the priority of the job. For Nimbus HPC systems the qos will match the partition name - and remember if you don't include it you will get an error.

#SBATCH --time= tells slurm how long you wish to run the job for, with several acceptable formats: "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds" e.g. 1-01:20:00 will request a runtime of 25 hours and 20 mins.

Submitting a simple job

Create a text file named test.txt with the following in your campaign folder.

My first nimbus run

Now we can create a runscript to copy this file over to the fast local storage at /mnt/resource/, print the details to an output file called my_output.txt and copy the output back at the end of the run.

Hint: You can check what accounts you have access to with sacctmgr, and you will also have access to a folder /campaign/account_code (not case sensitive - the campaign folder will be in capitals).

#!/bin/bash
#SBATCH --account=account_code_here
#SBATCH --job-name=JOB_NAME
#SBATCH --output=%x.%j.o
#SBATCH --error=%x.%j.e
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --partition=spot-fsv2-1
#SBATCH --qos=spot-fsv2-1
#SBATCH --time=04:00:00

# set campaigndir as our current working directory for copy back
campaigndir=$(pwd)
# create a workdir on the fast local disk
workdir=/mnt/resource/workdir
mkdir -p $workdir
# copy our input file over to our workdir
cp test.txt $workdir
# change dir to our workdir
cd $workdir
# do our run
cat test.txt > my_output.txt
# cp our output back to our campaigndir
cp my_output.txt $campaigndir/

Copying back the stderr and stdout files

If we copy everythig in our working directory to the compute node we will also copy the stderr and stdout files given by:

#SBATCH --output=%x.%j.o
#SBATCH --error=%x.%j.e

This will copy them at a snapshot in time from the start of our job. The file will continue to be written to in our campaign directory - so either we must specify a stderr stdout file on the local storage, or be sure not to copy back the stderr and stdout files from /mnt/resource/ - as it will over write anything that has subsequently been written to them with the files from an earlier point in time.

The next example runscript shows an approach for using rsync to avoid copying these files.

Openfoam

We will run an example job using the OpenFoam module for the hbv3-120 instance

OpenFOAM/v2012-foss-2020a

An example run script - create a file named run_job.slm with the following contents (remembering to replace the account code with that of your resource allocation):

#!/bin/bash
#SBATCH --account=prj3_phase1
#SBATCH --job-name=JOB_NAME
#SBATCH --output=JOB_NAME.%j.o
#SBATCH --error=JOB_NAME.%j.e
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=120
#SBATCH --partition=spot-hbv3-120
#SBATCH --qos=spot-hbv3-120
#SBATCH --time=04:00:00


# Load the openFOAM module 
module purge
module load OpenFOAM/v2012-foss-2020a

# as en example we will copy the dambreak tutorial to our current
# directory
cp -r $WM_PROJECT_DIR/tutorials/multiphase/interFoam/laminar/damBreak ./
cd damBreak

# set campaigndir as our current working directory for copy back

campaigndir=$(pwd)
localdisk=/mnt/resource
mkdir $localdisk/workdir
workdir=$localdisk/workdir


# Copy any inputs required to the work directory:
# excluding and files with a <JOB_NAME> (whatever your job name actualy is...) prefix, so your
# output and error files don't get over written on the copy back
rsync -aP --exclude=JOB_NAME.* $campaigndir/* $workdir

cd $workdir;
echo  "Work directory" $workdir ;

# source the foamDotFile and do the run
source $WM_PROJECT_DIR/etc/bashrc
./Allrun

# Copy back any results you need to campaign.
srun cp -Rf $workdir/*  $campaigndir/
#Clean up burstbuffer
rm -rf $workdir

And to run the job issue the following command in the terminal:

sbatch run_job.slm

Interactive job

It is also possible to obtain a compute node and work interactively.

Issuing the following command in the terminal will request a spot-hbv3-120 compute instance for 6 hours:

srun --partition spot-hbv3-120 --nodes 1  --account prj4_phase1  --qos spot-hbv3-120 --job-name "interactive" --cpus-per-task 120 --time 6:00:00 --pty bash