The Holland Computing Center (HCC) provides the primary high performance computing resources for J. Yang lab. This document below serves as a reference for the most commonly used commands and workflows for the workload manager SLURM. This documentation borrowed much of its text from Ross-Ibarra Lab wiki and blended with HCC documentation.

Quick Introdution to Computer Clusters

HCC cluster, i.e. crane, has a head node, which controls the cluster and compute nodes which is where the action happens. Crane runs on a cluster workload management system called Slurm. For the most part, you interact with Crane using scripts to launch jobs on the compute nodes; you DONOT run processes on the head node. The only tasks that acceptable on the head node are:

  • Downloading and transfering files (with wget or scp)
  • Compiling/building files
  • Installing R packages
  • Submitting or checking on jobs

File systems on crane

Imgur Key: User, Group and Entire system.

$HOME:

  • $HOME directories are backed up daily.
  • You can read and write.
  • But the size is small (20GB per user).
  • Normally used for configure files, user defined functions, user installed software packages.

$WORK:

  • $WORK is large, 50TB per user.
  • NOT backed up
  • But purge policy exists on /work, see here.
  • For computing and working. But, DONOT use it to store RAW Data.

$COMMON:

  • New file system. 1TB per use for free.
  • No purge policy.
  • No backups are made! Don’t be silly!
  • Used to store things (i.e. code, git repo) that are routinely needed on multiple clusters

Note: To gain access to the path on worker nodes, a job must be submitted with the following SLURM directive:

#SBATCH --licenses=common
srun --licenses=common

Connecting to crane:

  • Get an account here: ask Jinliang to learn our group id.
  • Set up DUO.
  • Connecting to crane using SSH.

SSH Config

Make your life a little easier by adding the following to ~/.ssh/config:

Host crane
    HostName crane.unl.edu
    User username

Replace username with your username. This will allow you to ssh to crane with just ssh crane in the future.

Getting to Know Slurm

Slurm is job managing system: you submit jobs via batch scripts. These batch scripts have common headers; we will see one below.

First, we can get a sense of our lovely cluster with sinfo:

$ sinfo
  	PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
hi           up   infinite      4 drain* c8-[25,39-40],c9-35
hi           up   infinite     64    mix c8-[22-24,30-38,45,54-58,60-61,84-87]
hi           up   infinite      6   idle c8-[42-44,59],c9-[94,97]
serial       up   infinite      2  down* c10-[12,40]
serial       up   infinite     15    mix c10-[8-9,11,13-22,41-42]
serial       up   infinite     18   idle c10-[10,23-39]
bigmeml      up   infinite      1  mixed bigmem2
bigmeml      up   infinite      5    mix bigmem[1,3-6]
bigmemm      up   infinite      1  mixed bigmem2
bigmemm      up   infinite      5    mix bigmem[1,3-6]
bigmemh      up   infinite      1  mixed bigmem2
bigmemh      up   infinite      5    mix bigmem[1,3-6]

Here we see our PARTITION and TIME LIMIT, and all of their inferior but still useful friends.

Note that there is a column of STATE, which indicates the state of the machine. A better way of looking at what’s going on on each machine is with squeue, which is the job queue.

$ squeue
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
       1541913   bigmemh some_job someone  R 6-04:27:02      1 bigmem6
       1613530   bigmemm some_job someone PD       0:00      1 (Resources)
       1544863   bigmeml     some_job someone R 5-10:59:32      1 bigmem2
       1472908        hi  some_job someone  R 5-22:16:23      1 c8-22
       1477386    serial some_job  someone  R 14-03:19:29      1 c10-11

This shows each job ID (very important), partition the job is running on, name of person running the job. Also note TIME which is how long a job has been running.

This queue is very important: it can tell us who is running what where, and how long it’s been running. Also, if we realize that we’re accidentally doing something silly like mapping maize reads to the human genome, we can use squeue to find the job ID, allowing us to cancel a job with scancel. Let’s kill jyang21’s job:

$ scancel JOBID

It’s that easy! Slurm is pretty boring so far; all we can do is look at the cluster and try to kill jobs. Let’s see how to submit jobs.

An Example Slurm Batch Script Header

We wrap our jobs in little batch scripts, which is nice because these also help make steps reproducible. We’ll see how to write batch scripts for Slurm in the next section, but suppose we had one written called steve.sh. To keep your directory organized, I usually keep a scripts/ directory (or even slurm-scripts/ if you have lots of other little scripts).

In each project directory, I make a directory called slurm-log for Slurm’s logs. Tip: use these logs, as these are very helpful in debugging. I separate them from my project because they fill up directories rather quickly.

Let’s look at an example batch script header for a job called steve (which is run with script steve.sh) that’s in a project directory named your-cool-project (you’re going to change these parts).

#!/bin/bash -l
#SBATCH -D /home/vince251/projects/your-cool-project/
#SBATCH -o /home/vince251/projects/your-cool-project/slurm-log/steve-stdout-%j.txt
#SBATCH -e /home/vince251/projects/your-cool-project/slurm-log/steve-stderr-%j.txt
#SBATCH -J steve
#SBATCH -t 24:00:00
set -e
set -u

# insert your script here
  • -D sets your project directory.
  • -o sets where standard output (of your batch script) goes.
  • -e sets where standard error (of your batch script) goes.
  • -J sets the job name.
  • -t sets the time limit for the job, 24:00:00 indicates 24 hours.

Note that the programs in your batch script can redirect their output however they like — something you will like want to do. This is the standard output and standard error of the batch script itself.

Also note that these directories must already be made — Slurm will not create them if they don’t exist. If they don’t exist, sbatch will not work and die silently (since there’s no place to write standard error). If you keep trying something and it doesn’t log the error, make sure all these directories exist.

As mentioned, the jobname is how you distinguish your jobs in squeue. If we ran this, we’d see “steve” in the JOBS column.

The time limit for the job should be greater than the estimated time to complete your job. Time-and-a-half or twice as much time as you think it will take are good rules. If your job reaches this time limit it will be killed. It’s frustrating to lose a job because you underestimate the time. Alternatively, you can set this with the –time flag (instead of -t, e.g. –time=1-00:00 sets a time limit of one day)

An example script

Try running this test script:

#!/bin/bash -l
#SBATCH -D /home/USERNAME
#SBATCH -J bob
#SBATCH -o /home/USERNAME/out-%j.txt
#SBATCH -e /home/USERNAME/error-%j.txt
#SBATCH -t 24:00:00
#SBATCH --array=0-8

bob=( 1 1 1 2 2 2 3 3 3 )
sue=( 1 2 3 1 2 3 1 2 3 )

block=${bob[$SLURM_ARRAY_TASK_ID]}
min=${sue[$SLURM_ARRAY_TASK_ID]}

echo "$block is $min" 

Make sure you switch your user name for USERNAME. You should see a bunch of files named “error” and “out” show up in your home directory. Try launching this using sbatch -p jyanglab to make sure you have access to both queues.

Using R with Slurm

Often, we need to work with R interactively on a server. To do this, we use srun with the following options:

$ srun --nodes=1 --mem 4G --ntasks=4 --licenses=common --time=8:00:00 --pty bash

This will drop you into an interactive R session on the partition specified by -p. --pty launches srun in terminal in pseudoterminal mode, which makes R behave as it would on your local machine. With srun you still have to set a time limit.

If you want to render R plots to your local computer through X11, here is a solution:

$ srunx your-partition Make sure your `X11` is open before you typing the command.

Using ssh -Y with X11 may seem like good idea, but this does not specify a partition to work interactively in, so you will end up running things on the head node. Bad idea.

Warnings

Do not run anything on the headnode except cluster management tools (squeue, sbatch, etc), compilation tasks (but usually ask CSE help for big apps), or downloading files. If you run anything on the headnode, you will disgrace your lab. How embarrassing is this? Imagine if you had to give your QE dressed up like Richard Simmons. It’s that embarrassing.

Monitor your disk space, as it can fill up quickly.