Slurm Compute Cluster

Permanent Link:

Slurm is a job scheduling system which can be used to efficiently use the available resources. Please note the MaintenanceWindow.

Usage Instructions
- Request access to the Slurm cluster of the MPCDF
- Request access to the Slurm cluster of the GWDG
Optimizing resource usage
FAQs

Usage Instructions

Important: The Slurm cluster is restarted on the first Wednesday of every odd month. This will kill all running jobs. Post-reboot, Slurm will attempt to automatically restart these jobs. Please keep this in mind when designing your jobs.

Connect to a Slurm login node:
user@host > getserver -sb
Have a look at the Quick Start Guide and our FAQs to learn more about Slurm.
Have a loot at our partition scheme to optimize your waiting time.
Use the sbatch or srun command to start a job.
Follow this guide to optimize your resource usage. To be clear: This is in your interest!

Note: For huge computations there are big Slurm clusters available at the MPCDF and the GWDG.

Request access to the Slurm cluster of the MPCDF

Go to https://selfservice.mpcdf.mpg.de/register/antrag.php?inst=MNPF&lang=en , select the person that should approve your request.
The selected person will/must approve the request.

Request access to the Slurm cluster of the GWDG

Write a ticket.
Find further instructions here.

Optimizing resource usage

To avoid wasting resources and to give others a chance to run their jobs timely, it's important to request realistic amounts of resources. Additionally, by optimizing the requested resources you can also decrease the time your job has to wait to be scheduled.

Run your job with enough resources.
Check the used resources as described in the FAQ section titled "How to get stats about used resources".
Estimate the required resources (including some buffer) for future jobs.

FAQs

Expand all Collapse all

What partitions are available?

Short running jobs are stronger protected from long running jobs. For this reason, all nodes are in the short partition. Fewer nodes are in standard, and even fewer nodes in long. The distribution of nodes will be adjusted in the future according to the monitored usage. The default partition is standard. Therefore, for shorter or longer jobs, a partition must be set.

Tip: Always use the partition that has the shortest fitting runtime limit. This way, you will get resources fastest.

Use the command sinfo --summarize to show current available partitions. It will also tell you what time limits exist for each partition.

Back to FAQ start

What are the specs of the compute nodes?

CPU:	4 x Intel Xeon Gold 6148 (20 cores, 2.4 GHz)
Cores	80 in total
Memory:	3 TB
GPU:	1 x NVIDIA RTX A4000
Network:	10 Gbit/s
Local temp storage:	8 TB

Back to FAQ start

How does a basic batch script look like?

To effectively schedule jobs, it's recommended to use the sbatch command because it queues the job and immediately returns to the shell. In contrast, srun waits until the resource is allocated and the job is completed, which risks job failure if the shell or SSH session is closed but is suitable for testing. In batch scripts, it is common practice to include Slurm parameters to simplify the process of starting jobs with identical resources. https://topi.cbs.mpg.de/slurm#pahttps://topi.cbs.mpg.de/slurm#partitionshttps://topi.cbs.mpg.de/slurm#partitionsrtitions

#!/bin/bash

# This script demonstrates how to submit a job using Slurm.

# Slurm directives are prefixed with #SBATCH. They define job resource requirements
# and settings. It's more convenient to include these directives in the script
# rather than passing them as command-line arguments to sbatch every time.

# Always specify the amount of memory and the runtime limit to prevent wasting resources.
# The number of CPUs is optional; if not specified, it defaults to 1.

#SBATCH -c 2                            # Request 2 CPU cores.
#SBATCH --mem 5G                        # Request 5 GB of memory.
#SBATCH --time 60                       # Set a runtime limit of 60 minutes.

# Replace 'my_command' and 'my_command2' with actual commands or scripts to run.
my_command --param1
my_command2 --paramX

Back to FAQ start

How can I process multiple datasets in parallel?

To process multiple datasets in parallel, consider setting up individual jobs for each dataset to execute these jobs across various compute nodes. This approach involves creating a batch script that accepts a dataset identifier as an input parameter and a launcher script containing a for-loop to start a job for each dataset.

Launcher script for starting jobs from a given list of datasets:

#!/bin/bash

# Run this script on the submit server.

# List of datasets to be processed
datasets=("dataset1" "dataset2" "dataset3")

# Loop through each dataset and submit a job
for ds in "${datasets[@]}"
do
  # Submit the job, passing the dataset name to the batch script
  sbatch my_batch_script.sh "$ds"
done

my_batch_script.sh:

#!/bin/bash
#
#SBATCH --time 30                       # Request a maximum runtime of 30 minutes
#SBATCH -c 2                            # Request 2 CPU cores
#SBATCH --mem 10G                       # Request 10 GB of memory

# Capture the dataset name from the command line argument
dataset=$1

# Execute the computation with the specified dataset
some_command --input $dataset

Back to FAQ start

How to get stats about used resources?

To get a rough overview of the used resources of a finished job, you can use the seff command:

user@host:/ > seff $JOBID
Job ID: 6944
Cluster: mpicbs
User/Group: thenmarkus/users
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 24
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:30:24 core-walltime
Job Wall-clock time: 00:00:38
Memory Utilized: 19.22 MB
Memory Efficiency: 0.04% of 48.00 GB

Unfortunately, this information is only gathered every 30 seconds while the job is running. This doesn't matter for CPU usage, but may not cover peak memory usage. To get the real maximum memory usage, you can simply add the following command to the end of your batch script:

check_ComputeClusterSlurm_memory-usage

The command will simply print the peak memory usage to the output file of your job.

Back to FAQ start

How can I run multiple job steps inside a single job?

Slurm allows for the execution of job steps within a single job allocation, facilitating complex computations efficiently. However, for tasks like processing multiple independent datasets, scheduling separate jobs is often more practical.

Here is a simple example for using job steps:

#!/bin/bash

#SBATCH --ntasks 2                      # Request 2 tasks/job steps to run in parallel
#SBATCH -c 24                           # Each main task requires 12 cores (24 in total)
#SBATCH --mem-per-cpu 1G                # Allocate 1GB of memory per CPU (12GB per task)
#SBATCH --time 60                       # Set a runtime limit of 60 minutes
#SBATCH -o /data/pt_12345/%j.out        # Redirect stdout to a file named after the job ID
#SBATCH -e /data/pt_12345/%j.err        # Redirect stderr to a file named after the job ID
#
# NOTE: Use %j as a placeholder for the job ID. Store log files in /data to avoid local
# storage in /tmp on compute nodes.

# Display the number of nodes allocated for the job
echo "Number of nodes: $SLURM_JOB_NUM_NODES"

# Run the hostname command for each allocated task to identify the node
for x in seq 1 $SLURM_NTASKS
  do
    srun -n 1 hostname &
  done    
wait

# Start 4 job steps. With 2 tasks requested and each job step requiring one task (-n 1),
# 2 job steps will run in parallel, and the others will queue until resources are available.
#
# IMPORTANT: Ensure job steps do not exhaust specific resources to enable parallel execution.
# Each job step is allocated 12 cores and 12GB memory by specifying -n 1. Mind that if you 
# used --mem instead of --mem-per-cpu in the job definition you would also have to specify
# the required amount of memory.

for x in part1 part2 part3 part4
  do
    srun -n 1 ./my_script $x &
  done

# Wait for all background jobs to complete
wait

Back to FAQ start

How can I use the software environments of the institute?

The MPI CBS offers software environments for accessing specific scientific software versions (e.g., Matlab), always named in uppercase (e.g., MATLAB). Available Linux software is listed here, and more about environments can be found here.

Initialize environments within your job, as Slurm jobs are not running on the submit server. There are two methods:

Option 1: In your batch script, include the environment for each command that needs it. This approach is useful when only a few commands require a specific environment, eliminating the need for a secondary shell script.

#!/bin/bash

#SBATCH --time 30                       # Request 30 minutes runtime
#SBATCH -c 2                            # Request 2 CPU cores
#SBATCH --mem 10G                       # Request 10 GB memory

# Command without using a software environment
echo "Job has started"

# Command using the MATLAB environment
MATLAB --version 9.10 matlab -nodesktop --some-parameter

# Command using the FREESURFER environment
FREESURFER freesurfer --some-parameter

# Another command using the MATLAB environment
MATLAB --version 9.10 matlab -nodesktop --some-other-parameter

# Combining SPM and MATLAB environments
SPM MATLAB some_command

Option 2: Call another script from your batch script and wrap it with the required environments. This is efficient for complex pipelines.

#!/bin/bash

#SBATCH --time 30                       # Request 30 minutes
#SBATCH -c 2                            # Request 2 CPU cores
#SBATCH --mem 10G                       # Request 10 GB memory

# Execute another script and wrap it with all required environments
# All executables and libraries of the given environments can then be used in the script
SPM MATLAB --version 9.10 FREESURFER /path/to/my_script.sh

Contents of my_script.sh:

#!/bin/bash

# Commands and libraries utilizing initialized environments

matlab -nodesktop --some-parameter
freesurfer --some-parameter
matlab -nodesktop --some-other-parameter

Back to FAQ start

How can I request GPUs?

Our cluster offers several GPUs. Specify GPU count, memory size, and optionally GPU architecture and/or Cuda version.

To check available GPUs:

sinfo -N -o "%.12P %.12N %.4c %.10m %.30G %.20f %N"

Partial GPU requests are not supported. Examples for requesting GPUs at the institute:

# Please mind that the following parameters work for sbatch, srun and sallow in the same way
# and you will have to add generic parameters like --mem, --time, etc.

# Request a single GPU (on nodes with multiple GPUs, this restricts access to just one GPU)
sbatch --gpus 1 /path/to/myscript.sh

# Request two GPUs on the same node
sbatch --gpus 2 /path/to/myscript.sh

# Request a GPU and filter for servers having Cuda 12.0 installed
# HINTS:
#  - The CUDA software environment can be used to initialize CUDA in your job.
#  - Have a look at the "Use Software Environments" section on this page.
sbatch --gpus 1 --constraint cuda12.0 /path/to/myscript.sh

# Request a GPU which has at least 20.000 MB memory
# HINTS: 
#  - The --gpus parameter is essential to actually allocate a GPU; the gpu_mem filter merely specifies your memory requirement.
#  - The filter does not limit the available amount of memory. If you get a GPU with 40 GB, you can use all of it.
sbatch --gpus 1 --gres gpu_mem:20000M /path/to/myscript.sh

# Request a GPU with a specific architecture and a minimum of memory
# HINTS:
#  - The GPUs are categorized by their architecture (turing, ampere, etc.)
#  - You cannot filter for a specific GPU (e.g: A 40, RTX 2080, etc.)
#  - Find more information about the --gres parameter here: https://slurm.schedmd.com/srun.html#OPT_gres
sbatch --gpus ampere:1 --gres gpu_mem:20000M /path/to/myscript.sh

Back to FAQ start

How can I use servers owned by my or another department/group?

In addition to the main IT servers, the Slurm cluster includes group servers purchased by specific departments or groups. To prioritize these groups, their servers are placed in distinct partitions. By default, jobs are not scheduled on group servers since the default partition d only central IT servers.

Each group server belongs to two extra partitions: one for its owning department (e.g. gr_weiskopf) and the group_servers partition, which includes all group servers. To prioritize your jobs and minimize waiting time for resources, schedule your jobs on servers owned by your group using the corresponding partition. For even higher priority, include both the default (all) and group_servers partitions in your job definition. Slurm will then allocate resources from the quickest available partition. If there is no partition for your group, use the all and group_servers partitions to access group servers when available.

Note that jobs submitted through the "group_servers" partition have lower priority and time limits, but Slurm ensures rapid resource allocation across the requested partitions.

Usage examples:

# List available partitions
sinfo

# Submit job to run on a central IT server by default as no other
# partitions are requested
sbatch /path/to/myscript.sh

# Submit job to any server in the cluster, including group-specific servers.
sbatch -p all,group_servers /path/to/myscript.sh

# Submit job with priority on department's nodes (e.g. gr_weiskopf) and access to central servers if available.
sbatch -p all,group_servers,gr_weiskopf /path/to/myscript.sh

Back to FAQ start

How can I monitor the GPU utilization of my job?

In certain situations, it's beneficial to track the GPU utilization of your job, such as monitoring the GPU memory usage or identifying performance bottlenecks. This can be achieved using the nvidia-smi tool. You should start it at the beginning of your batch script and kill it upon completion. It will continuously log some metrics of the GPU to a given file while your job is running.

This is how your batch script should look like:

#!/bin/bash

#SBATCH --time 60                            # Request 1 hour of runtime
#SBATCH --mem 1G                             # Request 1 GB memory
#SBATCH --gpus 1                             # Request a GPU

# Start nvidia-smi to monitor the GPU in the background and store the process ID to kill it later
# Please change the path "/path/to/nvidia_smi_output.csv"
# Based on the duration of your task, you may consider extending the sampling interval by 1 s (using the parameter "-l 1").
nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory,temperature.gpu,memory.free,memory.used --format=csv -l 1 > /path/to/nvidia_smi_output.csv &
NVIDIASMI_PID=$!

# DO YOUR COMPUTATIONS HERE #

# Terminate nvidia-smi
kill $NVIDIASMI_PID

Back to FAQ start

How to use temporary local storage?

Depending on your setup, using local temporary storage on a node can be beneficial. It offers ample space, high performance, and ease of use. However, remember that this temporary space is shared across all jobs on a node. When allocating resources, you can only request a node with a minimum amount of total temporary space, as the scheduler cannot track the current usage. Therefore, it's crucial to clean up your temporary data after your job completes, as Slurm does not automatically do this.

Here is a batch script example that requests a node with at least 1 TB of temporary space, creates a job-specific subfolderfor the current jobid, performs tasks, and then cleans up:

#!/bin/bash

#SBATCH --tmp 1024G                           # Request a temporary partition of minimum size 1024 GB
#SBATCH -c 1                                  # Request 1 CPU core
#SBATCH --mem 1G                              # Request 1GB of RAM
#SBATCH --time 60                             # Set maximum job runtime to 60 minutes

# Slurm stores the path to the temporary storage in an environment variable
# Create a new directory for the current job within the temporary storage.
job_tmp_dir=$TMPDIR/$SLURM_JOB_ID
mkdir $job_tmp_dir

# Execute some work, passing the temporary directory as an argument.
srun ./my_script --tmp $job_tmp_dir

# Clean up the temporary data at the end of the job.
rm -fr $job_tmp_dir

Back to FAQ start

This topic: EDV/FuerUser > WebHome > ComputeClusterSlurm
Topic revision: 23 Apr 2025, Wherbst

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback