user@host >
getserver -sb
sbatch
or srun
command to start a job.
all
partition:
CPU: | 4 x Intel Xeon Gold 6148 (20 cores, 2.4 GHz) |
---|---|
Cores | 80 in total |
Memory: | 3 TB |
GPU: | 1 x NVIDIA RTX A4000 |
Network: | 10 Gbit/s |
Local temp storage: | 8 TB |
sbatch
command because it queues the job and immediately returns to the shell. In contrast, srun
waits until the resource is allocated and the job is completed, which risks job failure if the shell or SSH session is closed but is suitable for testing. In batch scripts, it is common practice to include Slurm parameters to simplify the process of starting jobs with identical resources.
#!/bin/bash # This script demonstrates how to submit a job using Slurm. # Slurm directives are prefixed with #SBATCH. They define job resource requirements # and settings. It's more convenient to include these directives in the script # rather than passing them as command-line arguments to sbatch every time. # Always specify the amount of memory and the runtime limit to prevent wasting resources. # The number of CPUs is optional; if not specified, it defaults to 1. #SBATCH -c 2 # Request 2 CPU cores. #SBATCH --mem 5G # Request 5 GB of memory. #SBATCH --time 60 # Set a runtime limit of 60 minutes. # Replace 'my_command' and 'my_command2' with actual commands or scripts to run. my_command --param1 my_command2 --paramX
#!/bin/bash # Run this script on the submit server. # List of datasets to be processed datasets=("dataset1" "dataset2" "dataset3") # Loop through each dataset and submit a job for ds in "${datasets[@]}" do # Submit the job, passing the dataset name to the batch script sbatch my_batch_script.sh "$ds" donemy_batch_script.sh:
#!/bin/bash # #SBATCH --time 30 # Request a maximum runtime of 30 minutes #SBATCH -c 2 # Request 2 CPU cores #SBATCH --mem 10G # Request 10 GB of memory # Capture the dataset name from the command line argument dataset=$1 # Execute the computation with the specified dataset some_command --input $dataset
seff
command:
user@host:/ > seff $JOBID Job ID: 6944 Cluster: mpicbs User/Group: thenmarkus/users State: COMPLETED (exit code 0) Nodes: 2 Cores per node: 24 CPU Utilized: 00:00:00 CPU Efficiency: 0.00% of 00:30:24 core-walltime Job Wall-clock time: 00:00:38 Memory Utilized: 19.22 MB Memory Efficiency: 0.04% of 48.00 GBUnfortunately, this information is only gathered every 30 seconds while the job is running. This doesn't matter for CPU usage, but may not cover peak memory usage. To get the real maximum memory usage, you can simply add the following command to the end of your batch script:
check_ComputeClusterSlurm_memory-usage
The command will simply print the peak memory usage to the output file of your job.
#!/bin/bash #SBATCH --ntasks 2 # Request 2 tasks/job steps to run in parallel #SBATCH -c 24 # Each main task requires 12 cores (24 in total) #SBATCH --mem-per-cpu 1G # Allocate 1GB of memory per CPU (12GB per task) #SBATCH --time 60 # Set a runtime limit of 60 minutes #SBATCH -o /data/pt_12345/%j.out # Redirect stdout to a file named after the job ID #SBATCH -e /data/pt_12345/%j.err # Redirect stderr to a file named after the job ID # # NOTE: Use %j as a placeholder for the job ID. Store log files in /data to avoid local # storage in /tmp on compute nodes. # Display the number of nodes allocated for the job echo "Number of nodes: $SLURM_JOB_NUM_NODES" # Run the hostname command for each allocated task to identify the node for x in seq 1 $SLURM_NTASKS do srun -n 1 hostname & done wait # Start 4 job steps. With 2 tasks requested and each job step requiring one task (-n 1), # 2 job steps will run in parallel, and the others will queue until resources are available. # # IMPORTANT: Ensure job steps do not exhaust specific resources to enable parallel execution. # Each job step is allocated 12 cores and 12GB memory by specifying -n 1. Mind that if you # used --mem instead of --mem-per-cpu in the job definition you would also have to specify # the required amount of memory. for x in part1 part2 part3 part4 do srun -n 1 ./my_script $x & done # Wait for all background jobs to complete wait
#!/bin/bash #SBATCH --time 30 # Request 30 minutes runtime #SBATCH -c 2 # Request 2 CPU cores #SBATCH --mem 10G # Request 10 GB memory # Command without using a software environment echo "Job has started" # Command using the MATLAB environment MATLAB --version 9.10 matlab -nodesktop --some-parameter # Command using the FREESURFER environment FREESURFER freesurfer --some-parameter # Another command using the MATLAB environment MATLAB --version 9.10 matlab -nodesktop --some-other-parameter # Combining SPM and MATLAB environments SPM MATLAB some_commandOption 2: Call another script from your batch script and wrap it with the required environments. This is efficient for complex pipelines.
#!/bin/bash #SBATCH --time 30 # Request 30 minutes #SBATCH -c 2 # Request 2 CPU cores #SBATCH --mem 10G # Request 10 GB memory # Execute another script and wrap it with all required environments # All executables and libraries of the given environments can then be used in the script SPM MATLAB --version 9.10 FREESURFER /path/to/my_script.shContents of
my_script.sh
:
#!/bin/bash # Commands and libraries utilizing initialized environments matlab -nodesktop --some-parameter freesurfer --some-parameter matlab -nodesktop --some-other-parameter
sinfo -N -o "%.12P %.12N %.4c %.10m %.30G %.20f %N"Partial GPU requests are not supported. Examples for requesting GPUs at the institute:
# Please mind that the following parameters work for sbatch, srun and sallow in the same way # and you will have to add generic parameters like --mem, --time, etc. # Request a single GPU (on nodes with multiple GPUs, this restricts access to just one GPU) sbatch --gpus 1 /path/to/myscript.sh # Request two GPUs on the same node sbatch --gpus 2 /path/to/myscript.sh # Request a GPU and filter for servers having Cuda 12.0 installed # HINTS: # - The CUDA software environment can be used to initialize CUDA in your job. # - Have a look at the "Use Software Environments" section on this page. sbatch --gpus 1 --constraint cuda12.0 /path/to/myscript.sh # Request a GPU which has at least 20.000 MB memory # HINTS: # - The --gpus parameter is essential to actually allocate a GPU; the gpu_mem filter merely specifies your memory requirement. # - The filter does not limit the available amount of memory. If you get a GPU with 40 GB, you can use all of it. sbatch --gpus 1 --gres gpu_mem:20000M /path/to/myscript.sh # Request a GPU with a specific architecture and a minimum of memory # HINTS: # - The GPUs are categorized by their architecture (turing, ampere, etc.) # - You cannot filter for a specific GPU (e.g: A 40, RTX 2080, etc.) # - Find more information about the --gres parameter here: https://slurm.schedmd.com/srun.html#OPT_gres sbatch --gpus ampere:1 --gres gpu_mem:20000M /path/to/myscript.sh
gr_weiskopf
) and the group_servers
partition, which includes all group servers. To prioritize your jobs and minimize waiting time for resources, schedule your jobs on servers owned by your group using the corresponding partition. For even higher priority, include both the default (all
) and group_servers
partitions in your job definition. Slurm will then allocate resources from the quickest available partition. If there is no partition for your group, use the all
and group_servers
partitions to access group servers when available.
Note that jobs submitted through the "group_servers" partition have lower priority and time limits, but Slurm ensures rapid resource allocation across the requested partitions.
Usage examples:
# List available partitions sinfo # Submit job to run on a central IT server by default as no other # partitions are requested sbatch /path/to/myscript.sh # Submit job to any server in the cluster, including group-specific servers. sbatch -p all,group_servers /path/to/myscript.sh # Submit job with priority on department's nodes (e.g. gr_weiskopf) and access to central servers if available. sbatch -p all,group_servers,gr_weiskopf /path/to/myscript.sh
nvidia-smi
tool. You should start it at the beginning of your batch script and kill it upon completion. It will continuously log some metrics of the GPU to a given file while your job is running.
This is how your batch script should look like:
#!/bin/bash #SBATCH --time 60 # Request 1 hour of runtime #SBATCH --mem 1G # Request 1 GB memory #SBATCH --gpus 1 # Request a GPU # Start nvidia-smi to monitor the GPU in the background and store the process ID to kill it later # Please change the path "/path/to/nvidia_smi_output.csv" # Based on the duration of your task, you may consider extending the sampling interval by 1 s (using the parameter "-l 1"). nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory,temperature.gpu,memory.free,memory.used --format=csv -l 1 > /path/to/nvidia_smi_output.csv & NVIDIASMI_PID=$! # DO YOUR COMPUTATIONS HERE # # Terminate nvidia-smi kill $NVIDIASMI_PID
#!/bin/bash #SBATCH --tmp 1024G # Request a temporary partition of minimum size 1024 GB #SBATCH -c 1 # Request 1 CPU core #SBATCH --mem 1G # Request 1GB of RAM #SBATCH --time 60 # Set maximum job runtime to 60 minutes # Slurm stores the path to the temporary storage in an environment variable # Create a new directory for the current job within the temporary storage. job_tmp_dir=$TMPDIR/$SLURM_JOB_ID mkdir $job_tmp_dir # Execute some work, passing the temporary directory as an argument. srun ./my_script --tmp $job_tmp_dir # Clean up the temporary data at the end of the job. rm -fr $job_tmp_dir