Quickstart
Job Submission – sequential tasks
$ sbatch sample_script.sh $ cat sample_script.sh #!/bin/bash -l #All options below are recommended #SBATCH -p general # run on partition general #SBATCH --cpus-per-task=32 # 32 CPUs per task #SBATCH --mem=100GB # 100GB per task #SBATCH --gpus=4 # 4 GPUs #SBATCH --mail-user=bulls@usf.edu # email for notifications #SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE # events for notifications srun python first_script.py #1st task srun second_script.sh # 2nd task srun julia something.jl # 3rd task
Note: -p
and -w
must match the nodes listed when running the sinfo
command. E.g., -p Extended -w GPU17
would not work because GPU17 is not in partition Extended.
Important!!!!: Please do not place #SBATCH options after any Linux command (such as cd, srun, etc.) Slurm stops parsing for #SBATCH options after seeing the first Linux command.
Job Submission – parallel tasks
$ sbatch sample_script.sh $ cat sample_script.sh #!/bin/bash -l #SBATCH --cpus-per-task=32 # 32 CPUs per task #SBATCH --mem=100GB # 100GB per task #SBATCH --gpus-per-task=4 # 4 GPUs per task #SBATCH --ntasks=2 # specify 2 parallel tasks srun --ntasks=1 --cpus-per-task=16 --exact python first_script.py & srun --ntasks=1 --cpus-per-task=16 --exact second_script.sh & wait
Job Submission – Priority and Partitions
Slurm has a notion of “Partitions”, which determine which gaivi nodes your job can submit to and what priority you have. Use “sinfo” to see what partitions you have access to, e.g.
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST ScoreLab up infinite 1 idle GPU6 general* up 7-00:00:00 2 idle GPU[6,8]
By default, jobs are submitted to the general partition. To submit to an alternate partition, include the “-p” option in your submission. For example, in an SBATCH script
#SBATCH -p ScoreLab
(visit the corresponding page of the Job Submission section for more detail on the partitions in this cluster)
View pending and running jobs
squeue squeue -j [jobID]
Cancel a job
scancel [jobID]
Checking compute nodes for all resources
sinfo -N --format="%10N | %10t | %7X | %7Y | %7Z | %9m | %19G"
Checking compute nodes for current consumed resources
First create an interactive bash shell to a running job:
srun --pty --jobid <jobID> --interactive /bin/bash
Then run commands such as htop to check for CPU/memory consumption or nvidia-smi to check GPU consumption.
Anaconda Manual
Please refer to Anaconda Cheat Sheet.
IMPORTANT!!!!: you cannot install packages to the base environment. Please create your own environment to be able to install your own packages. For example,
conda create --name myEnvironment python=3.5 conda activate myEnvironment conda install -c anaconda cudatoolkit conda install -c anaconda tensorflow-gpu
Containerized Jobs
Step 1: Build a container image
You will need to build a container image then execute your code on the image. To build an image:
- Option 1: pull from Docker Hub an image that has everything you need:
$singularity pull docker://tensorflow/tensorflow:latest-gpu
Note: This must be executed from an ssh session on the login nodes. singularity pull will not work in a job context, like in a jupterhub session or srun.
- Option 2: Create a Dockerfile and build it on your machine. Here is an example of Dockerfile with Tensorflow.
Option 3: build an image by using our build service on GAIVI:We are performing maintenance on the container builder service at the moment.
$container-builder-client Dockerfile tf.sif
Step 2: Run your program on the container
$srun singularity exec --nv tf.sif python some\_script.py
You can also put the command above in an sbatch script.
To run an interactive shell in a container:
srun --pty singularity shell --nv tf.sif /bin/bash
If containerized programs work on your computer, they should work on GAIVI.
Jupyter Lab
Please open GAIVI's Jupyter Hub on your browser. Make sure you connect to USF VPN.
Contact
GAIVIADMIN@usf.edu