Compute Instances¶

Overview:

Teaching: 10 min
Exercises: 0 min

Questions

What compute instances are available?
How do I choose the right instance?
What do these cost?
How do I see the partitions?
What is spot and Pay as you Go?

Objectives

Understand what compute instances are currently available
Know where to get information on the compute instances, including costs
Understand the difference between spot and pay-as-you-go
Know which instance to use for your code

Slurm partitions¶

The new hpc service Nimbus provides user access to an array of different compute instances. These instances are accessed through different slurm partitions.

In order to list the partitions issue the following command:

sinfo

which will list the partitions, the nodes in the partitions, availability and the current state:

linux-login

The partitions follow a naming converntion

[pricing tier]-[instance type]-[no of cpus-per-node]

So, for example the partition spot-hbv3-120 is for spot priced hbv3 instances with 120 CPUs per node, while paygo-hc-44 is for pay-as-you-go priced hc instances with 44 CPUs per node.

Compute Instances¶

The instance types currently available are listed below:

Instance type	CPU model	vCPUS	GPU model	vGPUS
fsv2	Intel Skylake	2,4,8,16,32,48,64,72	-	-
hb	AMD Epyc Naples	60	-	-
hbv2	AMD Epyc Rome	120	-	-
hbv3	AMD Epyc Milan	120	-	-
hc	Intel Skylake	44	-	-
ncv3	Intel Broadwell	6	Tesla V100	1
ncv3	Intel Broadwell	12	Tesla V100	2
ncv3	Intel Broadwell	24	Tesla V100	4
ncv3r	Intel Broadwell	24	Tesla V100	4
ndv2	Intel Skylake	40	Tesla V100	8

As mentioned above these instances are referenced in the slurm partition, along with the number of CPUs and the pricing tier.

Instance types

You can read more about the instance types here: https://docs.microsoft.com/en-us/azure/virtual-machines/sizes including the specs for each instance type.

Choosing the right instance¶

Choosing the correct compute instance and partition will depend primarily on the code/caculation you are running, but it also will depend on cost, runtime and whether an interruption is acceptable (more on that in spot vs pay-as-you-go).

The compute instances listed above can be seperated into different types as described by azure:

High performance compute: hb, hbv2, hbv3 & hc series
GPU enabled: ncv3, ncv3r, & ndv2
Compute Optimised: fsv2

The descriptions of the instances provided by Azure are as follows:

High Performance Compute: Our fastest and most powerful CPU virtual machines with optional high-throughput network interfaces (RDMA).
GPU enabled: Specialized virtual machines targeted for heavy graphic rendering and video editing, as well as model training and inferencing (ND) with deep learning. Available with single or multiple GPUs.
Compute Optimised: High CPU-to-memory ratio. Good for medium traffic web servers, network appliances, batch processes, and application servers.

Briefly:

The hb* instances are suggested for applications driven by memory bandwidth such as OpenFOAM and ANSYS.
The hc* & f* instances are suggested for applications driven by compute such as HPL and ORCA.
The n* partitions are for GPU accelerated workloads and visualization sessions.

If in doubt - ask or test!

Spot vs Pay-As-You-Go¶

We have mentioned the pricing tier above, which is referenced in the slurm partition names. This pricing tier refers to the two different tiers:

Spot
Pay-As-You-go

Spot pricing allows access to unused Azure compute capacity at large discounts, up to 90%, compared to Pay-As-You-Go prices. The drawback is that the job can be interrupted at any time and be evicted, depending on the available Azure capacity.

Deciding whether to use the Spot or Pay-As-You-Go tiers will depend on how easy it is to pick up your calculation should it be evicted, the time scales for you calculation and your available budget.

If your job is subject to an eviction you will recieve a message in the stdout of the job, and the status in the finance portal will be EVICTED. At present if a job is evicted you will not be charged for it, though this may change in future.

Restrictions on Instances¶

The only restriction put in place by Research Computing on the size of jobs you are able to run on Nimbus is to restrict the number of nodes you can run on using the fsv2 instances to one. These are RDMA connected, and you cannot run single jobs across multiple nodes, hence the restriction.

All other restrictions are put in place by your Resource Allocation Administrators using the finance portal. Should you wish to change the restrictions either do so in the portal, or contact the person that has given you access to the resource allocation and request they do so.

Indicative Cost of Instances¶

It is important to emphasise that different compute instances will incur different costs.

A cost calculator has been launched that lists the current prices for each compute instance, as well as allowing you to cost up resources for inclusion in funding proposals.

https://cost-calc.hpc.bath.ac.uk/

Key Points:

Different compute instances are accessed through different slurm partitions
The Slurm partitions follow the naming convention: [pricing tier]-[instance type]-[no of cpus-per-node]
There are two pricing tiers: Spot and Pay-As-You-Go. Spot benefits from significant discounts (up to 90%), but may be evicted
Different instances are suitable for different calculation types and will cost different amounts

Compute Instances¶

Overview:

Slurm partitions¶

Compute Instances¶

Instance types

Choosing the right instance¶

Spot vs Pay-As-You-Go¶

Restrictions on Instances¶

Indicative Cost of Instances¶

Key Points:

Previous

Schedule

Next