Computer centres and batch systems

On Cosmos, there are 24+24 cores on each node with 5.3 GB memory on each core.
Thinlink login node: cosmos-dt.lunarc.lu.se
On Aurora, there are 20 cores on each node, in total with 62 GB memory (3.1 GB/core).

On Abisko, they instead come in multiples of 6. NB: You cannot run with less than 6 cores - if you run single-processor jobs, you will still be charged for 6 cores.
On Kebnekaise, there are 28 cores on each node
However, in the large-memory queue, they instead come in batches of (4x)18. There are 42 GB memory on each core, in total 3072 GB.
The disk space on the nodes are very limited. On largemem-nodes, it is around 391 GB in total. No, in practice it seems to be only 200 GB. This is shared between all 28 cores, without any rules, so the amount you can use is random.

On Tetralith, there are 32 (2 x 16) cores on each node.
Thin cores have 3 GB memory each and a disk of 210 GB in total.
Fat cores have 12 GB memory each and a disk of 874 GB in total.
There are also GPU nodes with 2TB disk.
For compiling Fortran, use buildenv-gcc, but only when compiling, not when running.

On Dardel, there are 128 cores (2 x 64) on each node.
Thin nodes have 2 GB memory each (256 in total).
There are also large, huge and giant nodes with 4, 8 and 16 GB each.
There are several partitions. To get less than a full node, use
#SBATH - p shared
The max length of a job is 24 h, except in partition long, where it is 168 h (but then you need to use a full node).

You log in by
kinit -f <user_name>@NADA.KTH.SE
ssh <user_name>@dardel.pdc.kth.se

If you have problem with similar jobs taking very different amount of time, try to use
#SBATCH --exclusive
Then you get a node or your own (and you are charged for all the cores, i.e. 16 on Alarik).

snicquota - to show quota on Lunarc
quota - at HPC2N

kinit - get a new password on HPC2N

Update pocket pass: https://lunarc-documentation.readthedocs.io/en/latest/authenticator_howto/#checking-the-validity-of-your-token
Note that you login with your lunarc user name and password, as when you log in to Aurora.

Go to: https://phenix3.lunarc.lu.se/selfservice/authenticate/unpwotp
Click on Tokens
Click on More
Click on Activate - Says success, but does change from Expired
Tried Activate Phenix Pocket Pass, but then a new token was installed.
After one day, some of them worked, I do not know which.


How to start jobs in the queue

sbatch - submit a job
squeue - See queued jobs
scancel - kill a job
scontrol - get lots of information about a job
scontrol show jobid -dd <jobid>

$SNIC_TMP - your folder on the temporary disk on the remote node
$SLURM_SUBMIT_DIR - the directory from which you submitted the job

scontrol show jobid 3216884 - gives estimated time for job to start

alias scjob="scontrol show jobid -dd"
cdjob() { 
        # Changes to the submission directory of a running job.
        # Arguments:
        #     $1: job number in SLURM queue
        direc=`scjob $1 | grep WorkDir | sed "s/   WorkDir=//"` ;
        cd ${direc} ;
}

Template sbatch file on Lunarc

#!/bin/sh
#BATCH -n 1
#SBATCH -t 168:00:00

module add intel

export AMBERHOME=/lunarc/nobackup/projects/bio/Amber12
export TURBODIR=/lunarc/nobackup/projects/bio/TURBO/Turbo6.5
export CNS_SOLVE=/sw/pkg/bio/CNS/cns_solve_1.21
PATH=$AMBERHOME/exe:$PATH
PATH=$TURBODIR/scripts:$TURBODIR/bin/x86_64-unknown-linux-gnu:$PATH
PATH=$PATH:/sw/pkg/bio/Bin/Gfortran:/sw/pkg/bio/Bin:$HOME/Bin
export PATH

cd $SNIC_TMP
/bin/rm -r *
cp -p $SLURM_SUBMIT_DIR/* .
jobex -backup -ri -c 800
cp -pu energy $SLURM_SUBMIT_DIR


d.o. with memory request

#!/bin/sh
#SBATCH -n 1
#SBATCH -t 168:00:00
#SBATCH --mem-per-cpu 3900

module add intel

export AMBERHOME=/lunarc/nobackup/projects/bio/Amber12
export TURBODIR=/lunarc/nobackup/projects/bio/TURBO/Turbo6.5
export CNS_SOLVE=/sw/pkg/bio/CNS/cns_solve_1.21
PATH=$AMBERHOME/exe:$PATH
PATH=$TURBODIR/scripts:$TURBODIR/bin/x86_64-unknown-linux-gnu:$PATH
PATH=$PATH:/sw/pkg/bio/Bin/Gfortran:/sw/pkg/bio/Bin:$HOME/Bin
export PATH

cd $SNIC_TMP
/bin/rm -r *
cp -p $SLURM_SUBMIT_DIR/* .
for x in 1 2 3 4 5 6 7 8 9 10 ; do
 ln -fs coord-c"$x" coord
 kdg restart
 ridft > $SLURM_SUBMIT_DIR/logd"$x"
 cp -p out.ccf $SLURM_SUBMIT_DIR/out"$x".ccf
done
cp -pu energy $SLURM_SUBMIT_DIR


Large memory queue on Kebnekaise

#!/bin/sh
#SBATCH -t 168:00:00
#SBATCH -p largemem
#SBATCH -A SNIC2016-34-18


GPU jobs on Kebnekaise

#SBATCH --gres=gpu:k80:x

with x=1, 2, or 4
For x=2, you are charged for 14 cores, for x=4, 28 cores

#SBATCH --gres=gpu:k80:x,mps
Nvidia Multi Process Service (

Modules

module add <module> - add module
module spider <module> - find out how to load a certain module with dependences
module purge  - remove all loaded modules
module help - get some help


Saving local files

If a job is terminated prematurely, for example, if it exceeds the requested walltime, the files on the local disk (in $SNIC_TMP) will be lost. Files that would still be useful can be listed in a special file $SNIC_TMP/pbs_save_files. Filenames are assumed to be relative to $SNIC_TMP and should be separated by spaces or listed on separate lines. These files will be copied to $PBS_O_WORKDIR regardless whether the job ends as planned or is deleted, unless there is a problem with the disk or node itself. For parallel jobs, only files on the master node will be copied. Note that this feature is unique to Lunarc.
On Alarik the corresponding file should instead be called $SNIC_TMP/slurm_save_files
(MUL 12/9-12)


Avoiding automatic restart of jobs in the queuing system

Some of you have noted that the queuing system on Aurora sometimes seemingly randomly restarts running jobs from the beginning again.

I just talked to Magnus about this and he said that you can avoid this behaviour by setting in the sbatch file 


#SBATCH --no-requeue


Then, the job will instead die (he said that it is caused by communication problems with a certain node) and you have to restart it by hand, but sometimes, especially with long MD simulations, this is strongly preferred.

UR 28/10-16


Information of CPU usage
alarik: projinfo -y (last year); if projinfo (last month) goes over allocation, you get a low priority
platon: projinfo (but only for last month)


checkjob


Problems with emacs fonts when run on lunarc:

add (on your local machine)
emacs*font: 7x14
to ~/.Xdefaults
and execute
xrdb -merge ~/.Xdefaults

Valera 23/10-14


Interactive jobs on Akka
salloc -n 1 time 1:00:00   
(one node and 1 hour)
When it replies that a node is granted, you can start jobs on that node with srun:
srun command


How to avoid using onetime password every time loggin in to Aurora

add to .ssh/config:
host *
  ControlMaster auto
  ControlPath ~/.ssh/ssh_mux_%h_%p_%r

Then, ssh aurora; use your password and onetime password.
Do not close the shell, open another konsole and do ssh or scp.

VV 15/11-17




Obsolete

Platon:
qsub
qstat
qdel
$PBS_O_LOCAL
$PBS_O_WORKDIR

On Alarik, there are 16 cores on each node (8 on each processor)


Template qsub file on Platon
#/bin/sh
#PBS -l nodes=1
#PBS -l walltime=70:00:00
#PBS -j oe

. use_modules
#module add intel/11.1
module add intel
#module add openmpi/1.4.1/intel/11.1

export AMBERHOME=/sw/pkg/bio/Amber10
export TURBOMOLE_SYSNAME=x86_64-unknown-linux-gnu
export TURBODIR=/sw/pkg/bio/TURBO/Turbo6.5
export CNS_SOLVE=/sw/pkg/bio/CNS/cns_solve_1.21
PATH=$AMBERHOME/exe:$PATH
PATH=$TURBODIR/scripts:$TURBODIR/bin/x86_64-unknown-linux-gnu:$PATH
PATH=$PATH:/sw/pkg/bio/Bin/Gfortran:/sw/pkg/bio/Bin:$HOME/Bin
export PATH

cd $PBS_O_LOCAL
#/bin/rm -r *
cp -p $PBS_O_WORKDIR/* .
jobex -backup -ri -c 800
cp -pu * $PBS_O_WORKDIR




Set up a new user on the computer centres

  1. New user: Go to  https://supr.naiss.se and register new person.

  2. Ulf: Log in to SUPR and add user to the project.

  3. New user: Fill in form (to SUPR) and send it in with a copy of passport

  4. New user: Go in to SUPR and apply for an account at both Lunarc and the other computer centres.