The Queueing System

Introduction

The queueing system used on Abacus4 is LoadLeveler von IBM. A program is sent as a job to a queue and is started by LoadLeveler gestartet, when enough resources become available. On abacus4 there are several queues, which are called classes within LoadLeveler. The class must be specified in the job control file (see below). Resources are shared amoungst users via a mechanism known as fairshare scheduling. Intermediate steps of a calculation may be saved by making use of checkpointing

Starting Jobs

Most jobs are submitted to the batch system with the command llsubmit.

The only exception to this are jobs which use the computational chemistry program GAUSSIAN. These are started with the command subg09. More information abot GAUSSIAN can be found here.

In order to run a batch job, a control script must be created which contains all the necessary control information:
    #!/usr/bin/tcsh
    # @ job_type = serial
    # @ arguments = 
    # @ input = 
    # @ output = /work/$(user)/ll$(jobid).stout
    # @ error = /work/$(user)/ll$(jobid).sterr
    # @ initialdir = /work/$(user)
    # @ notify_user = MY_E-MAIL@zedat.fu-berlin.de
    # @ class = medium
    # @ shell = /bin/tcsh
    # @ step_name = step1
    # @ notification = complete
    # @ checkpoint = yes
    # @ ckpt_dir = /scratch/$(user)/CPR
    # @ node_usage = shared
    # @ large_page = N
    # @ bulkxfer = no
    # @ smt = as_is
    # @ resources = ConsumableCpus(1) ConsumableMemory(1024 mb)
    # @ queue
    cd /scratch/$USER
    /work/$USER/my_prg  < my_input > my_output

The shell used in this file could also be ksh oder perl. The shell parser ignores all lines which begin with # @, wheres LoadLeveler ignores all other lines.

Please add your user name explicitly to the line notify_user, because $(user)@zedat.fu-berlin.de does not work.

The parameter ConsumableCpus and MemoryRequirements should be carefully chosen for the specific problem.

Users who which to run parallel programs should proceed as follows (e.g. for 4 process):
   #!/usr/bin/tcsh
   # @ job_type = parallel
   # @ arguments = $(jobid)
   # @ output = /work/$(user)/ll$(jobid).stout
   # @ error = /work/$(user)/ll$(jobid).sterr
   # @ initialdir = /work/$(user)/
   # @ notify_user = MY_E-MAIL@zedat.fu-berlin.de
   # @ class = medium
   # @ shell = /bin/tcsh
   # @ notification = complete
   # @ checkpoint = yes
   # @ ckpt_dir = /scratch/$(user)/CPR
   # @ node = 1
   # @ tasks_per_node = 4
   # @ node_usage = shared
   # @ network.mpi = sn_all,,us
   # @ smt = as_is
   # @ resources = ConsumableCpus(1) ConsumableMemory(1024 mb)
   # @ environment = MEMORY_AFFINITY=MCM; MP_DEVTYPE=ib; \
   #                 MP_SHARED_MEMORY=yes; MP_WAIT_MODE=poll; \
   #                 MP_SINGLE_THREAD=yes; MP_TASK_AFFINITY=MCM
   # @ queue
   cd /scratch/$USER
   cp $LOADL_HOSTFILE hostfile.$1
   setenv MYPROCS `wc -l hostfile.$1`
   unsetenv LOADLBATCH
   
   poe /work/$USER/my_mpi_prg -procs $MYPROCS -hostfile hostfile.$1
   rm hostfile.$1

Terminating Jobs

To kill a job use llcancel, e.g.
   llcancel a41.12345.0

Monitoring Jobs

You can inspect your jobs, the amount of CPU time consumed and the resources used, such as memory, by using the command
   llq
or
   llq -l a41.12345