Table of Contents

Quickstart

Job Submission – sequential tasks

$ sbatch sample_script.sh
$ cat sample_script.sh
#!/bin/bash -l
#All options below are recommended
#SBATCH -p general # run on partition general
#SBATCH --cpus-per-task=32 # 32 CPUs per task
#SBATCH --mem=100GB # 100GB per task
#SBATCH --gpus=4 # 4 GPUs
#SBATCH --mail-user=bulls@usf.edu # email for notifications
#SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE # events for notifications
srun python first_script.py #1st task
srun second_script.sh # 2nd task
srun julia something.jl # 3rd task

Note: -p and -w must match the nodes listed when running the sinfo command. E.g., -p Extended -w GPU17 would not work because GPU17 is not in partition Extended.

Important!!!!: Please do not place #SBATCH options after any Linux command (such as cd, srun, etc.) Slurm stops parsing for #SBATCH options after seeing the first Linux command.

Job Submission – parallel tasks

$ sbatch sample_script.sh
$ cat sample_script.sh
#!/bin/bash -l
#SBATCH --cpus-per-task=32 # 32 CPUs per task
#SBATCH --mem=100GB # 100GB per task
#SBATCH --gpus-per-task=4 # 4 GPUs per task
#SBATCH --ntasks=2 # specify 2 parallel tasks
srun --ntasks=1 --cpus-per-task=16 --exact python first_script.py &
srun --ntasks=1 --cpus-per-task=16 --exact second_script.sh &
wait

Job Submission – Priority and Partitions

Slurm has a notion of “Partitions”, which determine which gaivi nodes your job can submit to and what priority you have. Use “sinfo” to see what partitions you have access to, e.g.

$ sinfo
PARTITION AVAIL  TIMELIMIT NODES STATE NODELIST
ScoreLab     up   infinite     1  idle GPU6
general*     up 7-00:00:00     2  idle GPU[6,8]

By default, jobs are submitted to the general partition. To submit to an alternate partition, include the “-p” option in your submission. For example, in an SBATCH script

#SBATCH -p ScoreLab

(visit the corresponding page of the Job Submission section for more detail on the partitions in this cluster)

View pending and running jobs

squeue
squeue -j [jobID]

Cancel a job

scancel [jobID]

Checking compute nodes for all resources

sinfo -N --format="%10N | %10t | %7X | %7Y | %7Z | %9m | %19G"

Checking compute nodes for current consumed resources

First create an interactive bash shell to a running job:

srun --pty --jobid <jobID> --interactive /bin/bash

Then run commands such as htop to check for CPU/memory consumption or nvidia-smi to check GPU consumption.

Anaconda Manual

Please refer to Anaconda Cheat Sheet.

IMPORTANT!!!!: you cannot install packages to the base environment. Please create your own environment to be able to install your own packages. For example,

conda create --name myEnvironment python=3.5
conda activate myEnvironment
conda install -c anaconda cudatoolkit
conda install -c anaconda tensorflow-gpu

Containerized Jobs

Step 1: Build a container image

You will need to build a container image then execute your code on the image. To build an image:

$singularity pull docker://tensorflow/tensorflow:latest-gpu

Note: This must be executed from an ssh session on the login nodes. singularity pull will not work in a job context, like in a jupterhub session or srun.

$container-builder-client Dockerfile tf.sif

Step 2: Run your program on the container

$srun singularity exec --nv tf.sif python some\_script.py

You can also put the command above in an sbatch script.

To run an interactive shell in a container:

srun --pty singularity shell --nv tf.sif /bin/bash

If containerized programs work on your computer, they should work on GAIVI.

Jupyter Lab

Please open GAIVI's Jupyter Hub on your browser. Make sure you connect to USF VPN.

Contact

GAIVIADMIN@usf.edu