Running jobs on CCB cluster

Overview

Access to the shared CCB cluster is managed by Slurm workload manager. This has three advantages over the login nodes:

More memory. Cluster systems offer up to 2TB memory, permitting larger data sets. This can help with situations where your program is running out of memory, or you receive emails reporting too much memory used.
More parallel processing. You can run multiple instances of a program on different data sets, or different programs, or mix of both, all simultaneously.
Faster processing. Your program may be able to leverage the additional CPU cores to reduce processing time but this is application-dependent.

Slurm manages cluster access by putting jobs in a queue. A job defines a set of resource requirements, and a piece of work to run on those resources. These resources are: CPUs (or GPUs), memory, and time. If you have not specified these, then queue defaults are used which may be much greater than your work needs.

This queue is not simply first-in first-out - instead Slurm considers resource requirements and fair-share policy. When Slurm selects a job to run, it runs on any one of many servers called nodes. You get just the resources you asked for (CPUs and memory), and Slurm runs your job as if it were you, in the location you submitted the job from. With many nodes, then many jobs from many users can run simultaneously.

The Slurm queues

In order to help ensure that jobs of different sizes all get a fair chance to run on the cluster, JADE has 5 primary queues.

The 3 CPU queues are:

Name	Max CPUs	Max memory	Default time	Max time
`test`	8	50 GB	10 minutes	10 minutes
`short`	120	1850 GB	1 hour	1 day
`long`	128	1900 GB	1 hour	1 week

We also have 2 primary GPU queues. If you do not have a GPU preference, you can use gpu queue.

Name

Model

Architecture

Max

Cards

GPU

Memory

TFLOPS

Max

CPUs

Max

Memory

Default

time

Max

time

gpu-turing

NVIDIA

Titan RTX

Turing

CUDA cc 7.5

4

24 GB

/card

32.6

FP16

40

250 GB

1 hour

1 day

gpu-ada

NVIDIA

L40S

Ada Lovelace

CUDA cc 8.9

4

48 GB

/card

91.6

FP16

192

1900 GB

1 hour

1 day

If your job requires more than the maximum time, then contact us directly to discuss your requirements so that we can balance your request with the needs of other users.

Query commands

sinfo

List each queue. https://slurm.schedmd.com/sinfo.html

$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
long          up 7-00:00:00      5    mix imm-wn[1-5]
short*        up 1-00:00:00      2    mix imm-wn[6-7]
test          up      10:00      2    mix imm-wn[6-7]
gpu           up 1-00:00:00      2    mix imm-gn[1-4]
gpu-turing    up 1-00:00:00      1    mix imm-gn1
gpu-ada       up 1-00:00:00      1    mix imm-gn[2-4]

sbusy

Our custom tool to give more detail on cluster resource availability:

$ sbusy

Red = fully allocated to jobs. Yellow = near-full. Green = good availability.

squeue

List your running or pending jobs. https://slurm.schedmd.com/squeue.html

$ squeue
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
xxxxxxx      long     mut5  xxxxxxx  R      11:22      1 imm-wn2
xxxxxxx      long     move  xxxxxxx  R      30:50      1 imm-wn3
xxxxxxx short,lon     mut3  xxxxxxx PD       0:00      1 (Resources)
xxxxxxx     short    mut12  xxxxxxx PD       0:00      1 (Priority)

Job state R = running, PD = pending. The last column shows either the node the job is running on, or the reason it's waiting - Resources means waiting for free resource, and Priority means waiting behind a higher-priority job.

sacct

Inspect all jobs particularly those that have ended, and see their status. https://slurm.schedmd.com/sacct.html

$ sacct -u `whoami` -S "now-1days" -E "now" -o "JobID,JobName,Start,Elapsed,State,Reason"
JobID           JobName               Start    Elapsed      State                 Reason
------------ ---------- ------------------- ---------- ---------- ----------------------
2785216      run_job.sh 2025-01-29T20:02:24   00:10:19  COMPLETED                   None
2785216.bat+      batch 2025-01-29T20:02:24   00:10:19  COMPLETED
2787951      web_summa+ 2025-01-30T12:30:02   00:00:03     FAILED                   None
2787951.bat+      batch 2025-01-30T12:30:02   00:00:03     FAILED
2788251      web_summa+ 2025-01-30T15:15:02   02:14:48    RUNNING                   None 
2788251.bat+      batch 2025-01-30T15:15:02   02:14:48    RUNNING                        
2788252      web_summa+             Unknown   00:00:00    PENDING                   None 
2788253      web_summa+             Unknown   00:00:00    PENDING                   None

View your failed jobs with -s argument:

$ sacct -u `whoami` -S "now-1days" -E "now" -s "F,NF,OOM,S,TO" -o "JobID,JobName,Start,Elapsed,State,Reason"

Scheduling

Job priority

We aim to apply a fair-use policy to the Slurm queue, in which everyone gets a share of the available cluster time.

If you currently have no jobs, your first job will have maximum priority 2500.
For each extra job you add to queue, its priority is decremented:
```
job priority = 2500 - quantity of older jobs
```

The general outcome is those submitting fewer jobs will have them start sooner compared to users submitting many jobs. But there are two important caveats:

Jobs that request fewer resources - CPUs, memory, time - will generally start sooner because easier to schedule, even if a lower priority.
We cannot guarantee when your job will start but can provide an estimate.

Job start time

Command squeue can print more columns via its --format argument, including scheduled start time for pending jobs:

$ squeue --format="%.8j|%i|%P|%Y|%S|%l|%r" | column -s'|' -t
NAME      JOBID    PARTITION   SCHEDNODES  START_TIME           TIME_LIMIT  REASON
mut5      xxxxxxx  long        (null)      2025-09-09T12:42:57  1-00:00:00  None
move      xxxxxxx  long        (null)      2025-09-09T12:23:29  1-00:00:00  None
mut3      xxxxxxx  short,long  imm-wn2     2025-09-09T13:35:59  1-00:00:00  Resources
mut12     xxxxxxx  short       (null)      N/A                  1-00:00:00  Priority

If start time is N/A, then job cannot be scheduled yet. Consider start time as an approximation, because it changes as other jobs finish or are submitted:

your job can start sooner if a running job finishes early
your job can be delayed if a higher-priority job, or a smaller job, is submitted

To make typing that command easier, store it in a Bash alias:

$ alias q="squeue --format='%.8j|%i|%P|%Y|%S|%l|%r' | column -s'|' -t"
$ q
NAME      JOBID    PARTITION   SCHEDNODES  START_TIME           TIME_LIMIT  REASON
mut5      xxxxxxx  long        (null)      2025-09-09T12:42:57  1-00:00:00  None
...

Managing congestion

We have added custom traffic management to Slurm scheduling to ensure jobs are on the appropriate queue, and to allow a controlled number of short jobs to used spare resources in other queues. This means you may see your jobs running on a queue different to originally requested.

appropriate queuing

short-time jobs (< 1 day) on the long queue are moved to the short queue, where they should start sooner
jobs on GPU queues that did not specify number of GPUs are moved to short queue

spare resources

A controlled number of short-time jobs may use spare resources on other queues:

if the next long job start is >1 day in future, then short jobs may use spare resources in long queue
if gpu-ada servers have spare resources after GPU reservations, then short CPU jobs may use it. We reserve 1TB memory and ~50 CPUs for GPUs, leaving a similar amount free. This is practical because the gpu-ada servers are very capable, similar to the systems in other queues (ignoring the GPUs of course).

Slurm Job script

Create

Each job that you want to run is described in a job script. This defines what resources to reserve for job, then what to do when it runs. For example:

I want: to run STAR on this data
Deliver it to: 4 CPUs and 10GB memory for 2 hours

The job script might look like this:

#!/bin/bash

# Resources:
#SBATCH --time=0-02:00:00  # DAYS-HOURS:MINUTES:SECONDS
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=10G
#SBATCH --partition=short

# Environment:
#SBATCH --export=NONE

# Email when job completes or fails
#SBATCH --mail-type=END,FAIL

# What to run:
module load rna-star

STAR --genomeDir /databank/igenomes/Drosophila_melanogaster/UCSC/dm3/Sequence/STAR/ \
     --outSAMtype BAM SortedByCoordinate \
     --readFilesIn C1_R1_1.fq.gz C1_R1_2.fq.gz \
     --readFilesCommand zcat \
     --outSAMattributes All \
     --outFileNamePrefix C1_R1_ \
     --limitBAMsortRAM 7000000000 \
     --runThreadN 4

The top 5 lines tell Slurm what resources we think the job needs - time, CPU cores, memory, then which queue to submit to. If these are not specified then Slurm uses default values. Slurm will first look for default values in our system settings, such as queue times. If not defined in system, then Slurm will use the defaults built into sbatch. For more information on estimating the resource requirements of a job, please refer to profiling.

The 6th #SBATCH line overrides the default value of --export=ALL, so that your current Bash environment is not exported to your job. The 7th #SBATCH line, optional, sends you an email when specified job events occurred, in this case when job stops. Other event triggers are available.

The remainder of script is just list of commands to run when the job starts. Note that the backslash \ character is Bash's line continuation character, used to split long lines for legibility.

Requested resources are reserved for your running job regardless of what it actually uses, and Slurm only considers the requested resources in scheduling decisions, so think carefully what your requirements are and consider running a test job first to measure them. Requesting fewer resources will generally mean less time waiting in queue.

Ensure you regularly review your jobs resource efficiency

Submit

Basic job submission is with the command sbatch, so a simple minimal job submission could be just:

sbatch ./jobscript.sh

but you can also specify a partition (queue), number of nodes, and amount of memory (if not already in script) like so:

sbatch --partition=short --ntasks=1 --mem=10G ./jobscript.sh

Interactive job

There may be times when you really need to interactively run some code with access to more CPU and memory, for example when testing a new pipeline. You achieve this with the srun command:

srun --partition=short --cpus-per-task=4 --mem=32G --pty bash -i

Please note this reserves resources like a batch job, so avoid requesting more than necessary. Further, interactive jobs are only permitted on the test, short and gpu queues. Sessions left idle for extended periods will be terminated without notice.

Cancel

To stop a job, use scancel and the relevant job ID, for example: scancel 342552

GPU job

If you are new to GPUs, please complete our GPU tutorial. This will remove your GPU restrictions, becoming able to use the modern GPUs in the gpu-ada queue.

Responsible use

GPUs are a popular but limited resource, so please take extra care to ensure you use them responsibly and efficiently.

1) Be sure your program is using the GPU resources. Test in an interactive job, monitoring with nvidia-smi program. We have plans for batch jobs to also report GPU utilisation, but for now only nvidia-smi can.

2) Request an appropriate amount of GPU resources. Assess how much GPU memory your work needs - our gpu-turing GPUs have half memory of gpu-ada and usually less busy.

To submit a job to our GPU node you need to specify two additional parameters - the gpu partition and the number of GPUs you need using the --gpus argument. Confirm your job has been given gpu devices with scontrol:

$ scontrol show job $JOBID
   ...
   TresPerJob=gres:gpu:1

CUDA software

CUDA driver 12.2 is preinstalled in each GPU node. To load CUDA libraries and compilers then simply load one of our CUDA modules: module load cuda

Note: our CUDA module is a renamed NVIDIA HPC SDK toolkit. This is: CUDA library + NVIDIA compilers + math libs + comm libs. These NVIDIA compilers may not compile your particular software. The clue will be errors like:

nvc-Error-Unknown switch: -fwrapv

To revert to normal compilers:

unset CC CXX FC F90 F77 CPP

For all our NVIDIA modules, run module load NVIDIA-all. This loads: CUDA, cuDNN, NCCL, and RAPIDS.

Troubleshooting

requeued in held state

If we need to urgently reboot a cluster node where you have a job running, as this will interrupt your job then we attempt to requeue it. And to give you a chance to review input/output files before job restarts, we requeue into a held state:

$ squeue --format="%j, %i, %T, %r" | column -s, -t
NAME        JOBID     STATE     REASON
norm_ctcf   3065092   PENDING   job requeued in held state

To remove the holding state, just run: scontrol release 3065092

Performance

CPUs

Take care to ensure that the number of threads/processes your software launches does not exceed the requested CPUs. Nothing prevents this, and what will happen is your threads/processes compete for the insufficient CPUs allocated, causing severe performance degradation as they continously interrupt each other. Your job script can easily discover how many CPUs it has with nproc e.g.:

$ nproc
4

$ MY_NCPUS=$(nproc)
$ echo "My job has been given $MY_NCPUS CPUs"
My job has been given 4 CPUs

Storage

In addition to Ceph storage, each Slurm job has access to 30TB of local storage within each CPU node. This is called scratch as it is intended for temporary data, so ideal for your pipeline to store intermediate files.

Moving your intermediate files to scratch provides two advantages. Project quotas do not apply to scratch, just be aware you may be sharing the 30TB with other jobs. And the cache layer means for most cases, file operations will complete much faster than Ceph. Moving to scratch also reduces congestion in Ceph. Just remember to store your final data in Ceph before job ends.

Scratch access is managed by Slurm - each Slurm job is given a private folder for the duration of the job. We use Linux & Slurm features to map this private folder to /tmp, so that your jobs simply have to access folder /tmp instead of Ceph. (Before we enabled the remap, you had to reference variable $TMPDIR which pointed to somewhere in /var/scratch/$user/)

To help explain, consider this example:

what your job sees:

echo "hello world" > /tmp/hello-world.txt
$ ls /tmp
hello-world.txt
$ echo $TMPDIR
/tmp/

where file really is:

$ sudo ls /var/scratch/slurm-tmpfs/imm-wn7/3088789/.3088789
hello-world.txt

When your job ends, Slurm automatically deletes the folder.

Job profiling

You have two tools for reviewing how efficiently your jobs use the requested resources.

reportseff provides a quick broad summary:

JobID      State     Elapsed    NCPUS   CPUEff    ReqMem  MemEff 
xxxxxxx  COMPLETED   00:16:19    10     19.7%       100G  58.0%
...

charts in job logs showing resource use over time

                    CPU (cores)
  8 +----------------------------------------------------+
  6 |-+             ########                           +-|
  5 |-+       ######        #####                      +-|
  4 |-+     ##                   ###########           +-|
  3 |-+   ##                                ########   +-|
  2 |-+  #                                          #  +-|
  1 |-+ #                                            # +-|
    |#     +    +   +    +    +    +    +     +    + #   |
  0 +----------------------------------------------------+
    0      10   20  30   40   50   60   70    80   90  100
                               Job profiling step

Please review our in-depth guide to job profiling.

Advanced

Job arrays

If you find yourself in the situation of running the same processing on many data files, then consider a Slurm job array. This is a single Slurm script that launches many jobs, and the new Bash variable SLURM_ARRAY_TASK_ID helps you direct each to its data. This helps you organise, and helps Slurm manage its queue.

#!/bin/bash
#SBATCH --array=0-9%2      # 10 jobs total, 2 can run concurrently
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2

#SBATCH --mem-per-cpu=10G  # 10G memory per CPU = 20G total for job

# New variable '$SLURM_ARRAY_TASK_ID' is number 0 -> 9
process ./input_$SLURM_ARRAY_TASK_ID --output ./out_$SLURM_ARRAY_TASK_ID

If you know that different jobs in array have different resource requirements, and which they are, then you can fine-tune resources when submitting. Example:

give first two jobs more memory (overrides contents of my-array.sbatch):
```
sbatch --array=0-1%2 --mem-per-cpu=40G my-array.sbatch
```
launch the rest of array:
```
sbatch --array=2-9%2 my-array.sbatch
```

Our Slurm is configured with two limits that you may encounter when submitting 1000s of jobs:

MaxArraySize = 50,000 - maximum size of a single job array
MaxJobCount = 100,000 - maximum number of jobs that can exist in Slurm's 24-hour memory

Slurm's memory is not just active + pending jobs in the queue, but also recently-completed jobs. This gives Slurm the information it needs to prioritise light users over heavy users. If you submit many jobs that would exceed MaxJobCount then you will see this message:

sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying

The solution is to submit fewer, or to wait for Slurm to flush old entries from its 24-hour memory.

Job dependencies

If you have a pipeline with very different resource requirements over its lifetime - e.g. many CPUs at start then lots of memory at end - you can use a Slurm job dependency to split your one job into several linked sub-jobs. Here is a very simple example:

Pipeline stage 1 job

#!/bin/bash
#SBATCH --job-name=stage1
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=20
#SBATCH --mem=20G
#SBATCH --partition=short
echo "First job started running at `date`"
sleep 20

Pipeline stage 2 job

#!/bin/bash
#SBATCH --job-name=stage2
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=2
#SBATCH --mem=100G
#SBATCH --partition=short
echo "Second job started running at `date`"
sleep 5

Submit the first job and capture the assigned job ID. Then pass the job ID to Slurm when submitting the second job. The second job will not begin until after the first has finished.

jid=$(sbatch job1.sbatch | awk '{print $4}')
sbatch --dependency=$jid job2.sbatch
squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1924250     short   stage2 aowenson PD       0:00      1 (Dependency)
           1924249     short   stage1 aowenson  R       0:03      1 imm-wn7

As each job can run on a different cluster node, do not rely on transferring data via TMPDIR because it is not synchronised between nodes (and also it is automatically deleted at end of job).