====== Welcome to GAIVI's Documentation! ======= This project brings together investigators with research thrusts in several core disciplines of computer science and engineering: big datamanagement, scientific computing, system security, hardware design, data mining, computer vision and pattern recognition. GAIVI can accelerate existing research and enable ground-breaking new research that shares the common theme of massive parallel computing at USF. ===== What is on this site ===== * [[#notice_about_preemption|Notice About Preemption]] * [[#request_to_use_gaivi|General Information]] * [[#request_to_use_gaivi|Request to use GAIVI]] * [[#login_information|Login information]] * [[#priority_and_contributions|Priority and Contributions]] * [[#system_information|System information]] * [[main:gaivi:1.manual|User manual]] * [[main:gaivi:1.manual:1.quickstart|Quickstart]] * [[main:gaivi:1.manual:2.Job Submission|Job Submission]] * [[main:gaivi:1.manual:3.Data Storage|Data Storage]] * [[main:gaivi:1.manual:4.Working Environment|Working Environment]] * [[main:gaivi:1.manual:2.job_submission#jupyter_notebook_jobs|Jupyterhub]] * [[main:gaivi:1.manual:5.Workshop Recordings|Workshop Recordings]] * [[main:gaivi:2.discussions:start|Discussions]] * [[main:gaivi:2.discussions:1.FAQs|Common Bugs]] * [[main:gaivi:2.discussions:2.suggestions|Suggestions]] * [[main:gaivi:3.user_experience|Share your experience]] (e.g. how to configure IBM Federated Learning on GAIVI) * [[main:gaivi:3.user_experience:1.list|List of how-to's from all users]] * [[main:gaivi:3.user_experience:2.instructions|Instructions to create a new page]] ===== Notice About Preemption ===== GAIVI has job preemption and restarting enabled. This means that submitted compute jobs can be interrupted (effectively, cancelled), returned to the queue, and restarted if the hardware the job is using is needed by a higher priority job. When a job is preempted it has 5 minutes of grace time to clean up. Specifically, the job will receive a SIGTERM immediately and a SIGKILL 5 minutes later. This means your job should be prepared to gracefully exit within 5 minutes of receiving a SIGTERM at any time. Correspondingly, if you are trying to preempt a job please be prepared to wait up to 5 minutes for the preemption to take effect. We recommend most jobs be submitted as sbatch scripts and checkpoint regularly. If your job cannot work within these restrictions we do have a nopreempt partition, where jobs are safe from preemption, but it only includes a small handful of nodes. ===== Request to use GAIVI ===== Please create a request by filling out [[https://cseit-usf.atlassian.net/servicedesk/customer/portals|this form]]. A valid [[https://netid.usf.edu/|USF NetID]] is required; it will be your username for login. If you are a USF student, please also include your supervisor in the request; we will need their approval to proceed. ===== Login Information ===== First, please connect to USF VPN. You can use any [[https://en.wikipedia.org/wiki/Comparison_of_SSH_clients|ssh client]] to connect as you wish. Login information is as follows. * Host name: //gaivi.cse.usf.edu// * User name: your NetID * Password: your NetID Password ===== Priority and Contributions ===== GAIVI offers higher submission priority to teams which have significantly contributed to the development of the cluster. A significant hardware contribution should be worth at least $25k, a standard example being a whole compute node with current Nvidia GPU(s). The team in question retains some exclusive priority on a contributed node until a year after the end of the node's original warranty period. Additionally they gain access to the Contributors partition, which can submit to all nodes in the cluster at a higher priority than general users. A year after the warranty period ends the node will be moved to the general partition for all gaivi users to use. For more details on gaivi's partitions see [[main:gaivi:1.manual:start#partitions_of_compute_nodes|the user manual]]. ===== System Information ===== ^ Node name ^ Summary of role ^ CPU CORES ^ Processor type and speed ^ Memory ^ Card Info ^ GPU Memory ^ | GAIVI1 | the front node of the cluster | 12 cores | Dual Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz | 64 GB | - | - | | GAIVI2 | the front node of the cluster | 20 cores | Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz | 128 GB | - | - | | GPU1 | compute node with GPUs | 96 cores | AMD EPYC 7352 24-Core | 256 GB | 3 * NVIDIA A100 | 240 GB | | GPU2 | compute node with GPUs | 96 cores | AMD EPYC 7352 24-Core | 256 GB | 2 * NVIDIA A100 | 160 GB | | GPU3 | compute node with GPUs | 32 cores | Dual Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz | 128 GB | 4 * 1080 Ti | 44 GB | | GPU4 | compute node with GPUs | 32 cores | Dual Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz | 128 GB | 4 * 1080 Ti | 44 GB | | GPU6 | compute node with GPUs | 32 cores | Dual Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz | 128 GB | 4 * 1080 Ti | 44 GB | | GPU7 | compute node with GPUs | 32 cores | Dual Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz | 128 GB | 4 * 1080 Ti | 44 GB | | GPU8 | compute node with GPUs | 32 cores | Dual Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz | 128 GB | 4 * 1080 Ti | 44 GB | | GPU9 | compute node with GPUs | 32 cores | Dual Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz | 128 GB | 4 * 1080 Ti | 44 GB | | GPU11 | compute node with GPUs | 16 cores | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz | 128 GB | 8 * TITAN X (Maxwell) | 96 GB | | GPU12 | compute node with GPUs | 16 cores | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz | 128 GB | 8 * TITAN X (Maxwell) | 96 GB | | GPU13 | compute node with GPUs | 16 cores | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz | 128 GB | 8 * TITAN X (Maxwell) | 96 GB | | GPU14 | compute node with GPUs | 16 cores | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz | 128 GB | 8 * TITAN X (Maxwell) | 96 GB | | GPU15 | compute node with GPUs | 16 cores | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz | 384 GB | 8 * TITAN X (Maxwell) | 96 GB | | GPU16 | compute node with GPUs | 16 cores | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz | 384 GB | 4 * TITAN X (Maxwell) | 48 GB | | GPU17 | compute node with GPUs | 16 cores | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz | 128 GB | 4 * TITAN X (Maxwell) + 4 * TITAN V | 96 GB | | GPU18 | compute node with GPUs | 16 cores | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz | 128 GB | 8 * TITAN X (Pascal) | 96 GB | | GPU19 | compute node with GPUs | 16 cores | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz | 128 GB | 8 * TITAN X (Pascal) | 96 GB | | GPU21 | compute node with GPUs | 16 cores | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz | 128 GB | 8 * 1080 Ti | 88 GB | | GPU22 | compute node with GPUs | 20 cores | Dual Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz | 1024 GB | 8 * 1080 Ti | 88 GB | | GPU41 | compute node with GPUs | 16 cores | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz | 128 GB | 4 * TITAN X (Maxwell) | 48 GB | | GPU42 | compute node with GPUs | 32 cores | Dual Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz | 192 GB | 4 * Titan RTX | 96 GB | | GPU43 | compute node with GPUs | 64 cores | AMD EPYC 7662 64-Core | 512 GB | 4 * A40 | 192 GB | | GPU44 | compute node with GPUs | 32 cores | AMD EPYC 7532 32-Core | 512 GB | 4 * A40 | 192 GB | | GPU45 | compute node with GPUs | 24 cores | AMD EPYC 7413 24-Core | 256 GB | 1 * A100 | 80 GB | | GPU46 | compute node with GPUs | 96 cores | AMD EPYC 7352 24-Core | 2 TB | 3 * A100 | 240 GB | | GPU47 | compute node with GPUs | 32 cores | AMD EPYC 7513 32-Core | 512 GB | 2 * A100 | 160 GB | | GPU48 | compute node with GPUs | 128 cores | AMD EPYC 9554 64-Core | 768 GB | 6 * H100 | 480 GB | | GPU49 | compute node with GPUs | 32 cores | Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz | 128 GB | 2 * A100 + 2 * L40S | 256 GB | | PHI1 | compute node with Intel PHI cards | 32 cores | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz | 128 GB | 7 * Phi 5110P | 56 GB | | STORAGE2 | storage node | 24 cores | Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz | 512 GB | - | - | | STORAGE3 | storage node | 16 cores | Dual Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz | 196 GB | - | - | | STORAGE4 | storage node | 32 cores | AMD EPYC 7313P 16-Core | 128 GB | - | - | | STORAGE6 | storage node (NVMe) | 32 cores | AMD EPYC 9354P 32-Core | 384 GB | - | - |