Queueing System

Programs on Soroban are not usually started interactively, but rather sent as a job to a so-called queueing system. The system used here is Slurm (Simple Linux Utility for Resource Managment). The queueing system is configured to attempt to ensure that the resources are shared fairly amongst all users. This means that jobs belonging to users who have not consumed much CPU time in the recent past will tend to start before jobs belonging to users who have been more active. This approach is known as fairshare scheduling.

Basic Commands

The most important commands are:
sbatch submit a job script e.g sbatch myscript.sh
scancel cancel a job e.g. scancel 123
sinfo show available partitions and nodes e.g. sinfo -l
squeue / smap show jobs in queue e.g. squeue -u myusername
You can find more information via the corresponding man pages, e.g. man sbatch. Note that a job script must be executable - this can be achieved with chmod u+x myscript.sh.

Example Batch Scripts

General comments

Work should generally done on the scratch file system. Only the results that need to be kept after completion should be created on or copied to the home file system. More information on the file systems can be found here.

Lines which begin with #SBATCH contain control information for the queueing system. For the shell these are just comments and are ignored. You can also use an arbitrary scripting language (csh, ksh, Python, Perl, ...) as long as the syntax #SBATCH is a valid comment.

Simple serial job

This is a very simple example for a single serial job. The job name given appears in the NAME field in the output of squeue. An email will be sent to the address given when the job completes.

#!/bin/bash

#SBATCH --mail-user=username@zedat.fu-berlin.de # replace with your own address
#SBATCH --job-name=my_serial_job                # replace the name 
#SBATCH --mail-type=end
#SBATCH --mem=2048                              # replace with amount suitable for your job
#SBATCH --time=08:00:00                         # replace with amount suitable for your job

cd /scratch/username                            # replace with your own directory
serial_prog > serial.out                        # replace with your own program

The option --mem specifies the maximum amount of memory required in MB per node. This should always be given, since the default value is only 1 MB per CPU, which will not be enough for many jobs.

The option --time specifies the maximum amount of wall-clock time required by the job. If this is not given, the default value for the partition will apply and the job will not be able to benefit from backfilling.

Simple MPI parallel job

This is a very simple example for an MPI parallel job. The number of nodes on which the job is to be run must be given. In addition, the module corresponding to the MPI implementation used to compile the program must be loaded.

#!/bin/bash

#SBATCH --mail-user=username@zedat.fu-berlin.de # replace with your own address
#SBATCH --job-name=my_parallel_job              # replace name
#SBATCH --mail-type=end
#SBATCH --ntasks=32                             # replace with amount suitable for your job
#SBATCH --mem-per-cpu=4096                      # replace with amount suitable for your job
#SBATCH --time=08:00:00                         # replace with amount suitable for your job

module load gcc openmpi/gcc                     # replace with module suitable for your job

cd /scratch/username                                  # replace with your directory
mpirun -np $SLURM_NTASKS parallel_prog > parallel.out # replace with your program

The memory requirement is normally given via --mem-per-cpu. It is also possible to specify the maximum amount of memory required per node in MB via the option --mem. The memory needed should always be given, as otherwise the default value will come into effect, as described above. The variable $SLURM_NTASKS automatically contains the value specified by --ntasks.

Simple SMP parallel job

This is a very simple example for an SMP parallel job. The number of tasks required for the job must be given.

#!/bin/bash

#SBATCH --mail-user=username@zedat.fu-berlin.de # replace with your address
#SBATCH --job-name=my_smp_parallel_job          # replace name
#SBATCH --mail-type=end
#SBATCH --ntasks=8                              # replace with amount suitable for your job
#SBATCH --nodes=1-1
#SBATCH --mem-per-cpu=4096                      # replace with amount suitable for your job
#SBATCH --time=08:00:00                         # replace with amount suitable for your job

cd /scratch/dummyuser                           # replace with your directory
smp_prog > parallel.out                         # replace with your program

Note that the number of nodes required is explicitly set to one via --nodes=1-1. Otherwise the cores assigned to the job may be on multiple nodes and thus not all accessible to the job.

Job array

For a large number of jobs which can be parameterised by a single integer, job arrays can be used:

#!/bin/bash

#SBATCH --mail-user=username@zedat.fu-berlin.de # replace with your own address
#SBATCH --job-name=my_array_job                 # replace the name 
#SBATCH --mail-type=end
#SBATCH --mem=2048                              # replace with amount suitable for your job
#SBATCH --time=08:00:00                         # replace with amount suitable for your job

cd /scratch/username                            # replace with your own directory
my_prog ${SLURM_ARRAY_TASK_ID} > my_prog.out    # replace with your own program

An job array is submitted in the following manner

sbatch --array=1-20 my_script.sh

For each value specified by the --array option, the variable ${SLURM_ARRAY_TASK_ID} in the script is replaced by that value. The maximum array size is 3001

Multistep serial job

Sometimes a job will consist of several steps which have to be carried out one after another in a chain. The following Perl script can be used for this case.

#!/usr/bin/perl

use strict;
use warnings;

my $script_file = '~/chain/script.sh';           # Define Slurm script to be
my $log_file    = '~/chain/out.log';             # run and common output file

my @parameter_sets = ("first $log_file",         # Define parameters set to
                      "second $log_file ",       # be passed for each step
                      "third $log_file");

my $job_id   = -1;                               # Initialize job ID variable

# Main loop over parameter sets
#
for (my $i = 0; $i < scalar(@parameter_sets); $i++) {

    my $cmd = 'sbatch';                          # Construct call to sbatch
    if ($job_id > 0) {                           # and add dependency
        $cmd .= " --dependency=afterOK:$job_id";
    }
    $cmd .= " $script_file $parameter_sets[$i] $i 2>&1";

    my $output = `$cmd`;                         # Submit job and abort if no
    ($output =~ /^Submitted batch job (\d+)/)    # job ID is returned
      or die "Error executing $cmd";

    $job_id = $1;
}

The above Perl script is executed interactively and generates jobs which are sent to the queuing system and then run by it.

Memory

As mentioned above, the amount of memory a job requires must be given via the --mem-per-cpu or --mem (memory per node) option. The total memory is given in the table below:

Nodes # Memory
node003-node042 40 24 GB
node001-node002, node043-node100 60 48 GB
node101-node112 12 96 GB

However, because the nodes are diskless, the entire operating system has to be held in memory, so the actual amount of memory available to users is less. This value is displayed as RealMemory by the command scontrol show node <node name>, e.g.

$ scontrol show node node003
NodeName=node003 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=7 CPUErr=0 CPUTot=12 Features=(null)
   Gres=(null)
   OS=Linux RealMemory=18000 Sockets=2
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2013-07-19T03:21:29 SlurmdStartTime=2013-07-19T03:39:31
   Reason=(null)
$ scontrol show node node101
NodeName=node101 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=11 CPUErr=0 CPUTot=12 Features=bigmem
   Gres=(null)
   OS=Linux RealMemory=90000 Sockets=2
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=2
   BootTime=2013-06-28T04:47:24 SlurmdStartTime=2013-06-28T04:57:41
   Reason=(null)

The importance of estimating memory requirments accurately is explained here.

Partitions

The cluster is divided into several so-called partitions. A partition refers to a group of nodes which are characterised by their hardware. The available partitions on soroban can be listed via the command sinfo.

The partition can be set by using the following in the submit script:

#SBATCH --partition=gpu

Quality of Service

A quality of service or QOS is a set of parameters which are associated with job. The QOSs available, together with the properties they define, can be seen via the command sqos.

A QOS can increase the priority of a job, but this will be offset by a reduction in the maximum run-time of the job. In addition, only a certain number of jobs per user can be run or submitted in a given QOS.

The QOS can be set by using the following in the submit script:

#SBATCH --qos=medium

Thus it is possible to start a test job in the partition 'main' with the QOS 'short' which runs on a large number of nodes. Please note that a limit, such as TimeLimit, set by the QOS will override that set by the partition.

Interactive Jobs

Interactive jobs within the queuing system can be run using the command srun, e.g. a shell can be started with the following command:

$ srun --ntasks=1 --time=00:30:00 --mem=1000 bash

A console-based MATLAB session can be started with the following commands:

$ module add matlab
$ srun --licenses=matlab_MATLAB --partition=interactive --qos=short --reservation=licenses_matlab --mem=4000 --pty matlab -nodesktop -nosplash

For graphical applications, the option --x11 must be used, e.g. an interactive job using the MATLAB graphical interface can be started with the following commands:

$ srun --partition=interactive --qos=short --reservation=licenses_matlab --licenses=matlab_MATLAB --ntasks=1 --cpus-per-task=1 --time=00:30:00 --mem=4000 --pty --x11 bash
$ module add matlab
$ matlab

To run multiple programs in parallel, the resources must first be reserved using the command salloc, e.g.

$ salloc --partition=interactive --qos=short --ntasks=4 --nodes=1-1 --time=00:30:00 --mem=4000

Then, say, two jobs, job01.sh and job02.sh, requiring one and three cores, respectively, can be started in the following manner:

$ srun --partition=interactive --qos=short --ntasks=1 job01.sh && srun --partition=interactive --ntasks=3 job02.sh