====== Quickstart ====== =====Job Submission – sequential tasks===== $ sbatch sample_script.sh $ cat sample_script.sh #!/bin/bash -l #All options below are recommended #SBATCH -p general # run on partition general #SBATCH --cpus-per-task=32 # 32 CPUs per task #SBATCH --mem=100GB # 100GB per task #SBATCH --gpus=4 # 4 GPUs #SBATCH --mail-user=bulls@usf.edu # email for notifications #SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE # events for notifications srun python first_script.py #1st task srun second_script.sh # 2nd task srun julia something.jl # 3rd task Note: ''-p'' and ''-w'' must match the nodes listed when running the ''sinfo'' command. E.g., ''-p Extended -w GPU17'' would not work because GPU17 is not in partition Extended. __**Important!!!!**__: Please do not place #SBATCH options after any Linux command (such as cd, srun, etc.) Slurm stops parsing for #SBATCH options after seeing the first Linux command. =====Job Submission – parallel tasks===== $ sbatch sample_script.sh $ cat sample_script.sh #!/bin/bash -l #SBATCH --cpus-per-task=32 # 32 CPUs per task #SBATCH --mem=100GB # 100GB per task #SBATCH --gpus-per-task=4 # 4 GPUs per task #SBATCH --ntasks=2 # specify 2 parallel tasks srun --ntasks=1 --cpus-per-task=16 --exact python first_script.py & srun --ntasks=1 --cpus-per-task=16 --exact second_script.sh & wait =====Job Submission – Priority and Partitions===== Slurm has a notion of "Partitions", which determine which gaivi nodes your job can submit to and what priority you have. Use "sinfo" to see what partitions you have access to, e.g. $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST ScoreLab up infinite 1 idle GPU6 general* up 7-00:00:00 2 idle GPU[6,8] By default, jobs are submitted to the general partition. To submit to an alternate partition, include the "-p" option in your submission. For example, in an SBATCH script #SBATCH -p ScoreLab (visit the corresponding page of the Job Submission section for more detail on the partitions in this cluster) =====View pending and running jobs===== squeue squeue -j [jobID] =====Cancel a job===== scancel [jobID] =====Checking compute nodes for all resources===== sinfo -N --format="%10N | %10t | %7X | %7Y | %7Z | %9m | %19G" =====Checking compute nodes for current consumed resources===== First create an interactive bash shell to a running job: srun --pty --jobid --interactive /bin/bash Then run commands such as htop to check for CPU/memory consumption or nvidia-smi to check GPU consumption. =====Anaconda Manual===== Please refer to [[https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf|Anaconda Cheat Sheet]]. **__IMPORTANT!!!!__**: you cannot install packages to the base environment. Please create your own environment to be able to install your own packages. For example, conda create --name myEnvironment python=3.5 conda activate myEnvironment conda install -c anaconda cudatoolkit conda install -c anaconda tensorflow-gpu =====Containerized Jobs===== ====Step 1: Build a container image==== You will need to build a container image then execute your code on the image. To build an image: * Option 1: pull from Docker Hub an image that has everything you need: $singularity pull docker://tensorflow/tensorflow:latest-gpu **Note**: This must be executed from an ssh session on the login nodes. singularity pull will not work in a job context, like in a jupterhub session or srun. * Option 2: Create a Dockerfile and build it on your machine. Here is an example of Dockerfile with [[https://towardsdatascience.com/how-to-compile-tensorflow-1-12-on-ubuntu-16-04-using-docker-6ca2d60d7567|Tensorflow]]. * Option 3: build an image by using our build service on GAIVI: We are performing maintenance on the container builder service at the moment. $container-builder-client Dockerfile tf.sif ====Step 2: Run your program on the container==== $srun singularity exec --nv tf.sif python some\_script.py You can also put the command above in an sbatch script. To run an interactive shell in a container: srun --pty singularity shell --nv tf.sif /bin/bash If containerized programs work on your computer, they should work on GAIVI. =====Jupyter Lab===== Please open [[https://jupyterhub.gaivi.cse.usf.edu/|GAIVI's Jupyter Hub]] on your browser. Make sure you connect to USF VPN. =====Contact===== GAIVIADMIN@usf.edu