====== Welcome to GAIVI's Documentation! =======


This project brings together investigators with research thrusts in several core disciplines of computer science and engineering: big datamanagement, scientific computing, system security, hardware design, data mining, computer vision and pattern recognition. GAIVI can accelerate existing research and enable ground-breaking new research that shares the common theme of massive parallel computing at USF.


===== What is on this site =====
  * [[#notice_about_preemption|Notice About Preemption]]

  * [[#request_to_use_gaivi|General Information]]
    * [[#request_to_use_gaivi|Request to use GAIVI]]
    * [[#login_information|Login information]]
    * [[#terms_of_service|Terms of Service]]
    * [[#priority_and_contributions|Priority and Contributions]]
    * [[#system_information|System information]]


  * [[main:gaivi:1.manual:start|User manual]]
    * [[main:gaivi:1.manual:1.quickstart|Quickstart]]
    * [[main:gaivi:1.manual:2.Job Submission|Job Submission]]
    * [[main:gaivi:1.manual:3.Data Storage|Data Storage]]
    * [[main:gaivi:1.manual:4.Working Environment|Working Environment]]
    * [[main:gaivi:1.manual:2.job_submission#jupyter_notebook_jobs|Jupyterhub]]
    * [[main:gaivi:1.manual:5.Workshop Recordings|Workshop Recordings]]

  * [[main:gaivi:2.discussions:start|Discussions]]
    * [[main:gaivi:2.discussions:1.FAQs|Common Bugs]]
    * [[main:gaivi:2.discussions:2.suggestions|Suggestions]]

  * [[main:gaivi:3.user_experience|Share your experience]] (e.g. how to configure IBM Federated Learning on GAIVI)
    * [[main:gaivi:3.user_experience:1.list|List of how-to's from all users]]
    * [[main:gaivi:3.user_experience:2.instructions|Instructions to create a new page]]


===== Notice About Preemption =====

GAIVI has job preemption and restarting enabled. This means that submitted compute jobs can be interrupted (effectively, cancelled), returned to the queue, and restarted if the hardware the job is using is needed by a higher priority job. When a job is preempted it has 5 minutes of grace time to clean up. Specifically, the job will receive a SIGTERM immediately and a SIGKILL 5 minutes later. This means your job should be prepared to gracefully exit within 5 minutes of receiving a SIGTERM at any time. Correspondingly, if you are trying to preempt a job please be prepared to wait up to 5 minutes for the preemption to take effect. We recommend most jobs be submitted as sbatch scripts and checkpoint regularly. If your job cannot work within these restrictions we do have a nopreempt partition, where jobs are safe from preemption, but it only includes a small handful of nodes.

===== Request to use GAIVI =====
Please create a request by filling out [[https://cseit-usf.atlassian.net/servicedesk/customer/portals|this form]]. A valid [[https://netid.usf.edu/|USF NetID]] is required; it will be your username for login. If you are a USF student, please also include your supervisor in the request; we will need their approval to proceed.


===== Login Information =====
First, please connect to USF VPN. You can use any [[https://en.wikipedia.org/wiki/Comparison_of_SSH_clients|ssh client]] to connect as you wish. Login information is as follows.
  * Host name: //gaivi.cse.usf.edu//
  * User name: your NetID
  * Password: your NetID Password


===== Terms of Service =====
Welcome to the GAIVI Cluster. By accessing this system, you agree to the following terms and conditions, designed to ensure the security, stability, and fairness of the computing environment for all users.

1. **Operating Hours**
  * Regular operating hours are 10:00 AM to 4:00 PM, Monday through Friday, excluding University holidays and closures. Administrative and technical support is provided only during these hours.

2. **Account Privileges and Security**
  * No Administrator Privileges: User accounts do not have administrator privileges. The use of sudo or any other privilege escalation command is strictly prohibited.
  * Unauthorized Activity: All attempts to use sudo are automatically logged and flagged for review. Unauthorized activity may result in immediate suspension of the account for verification. This measure is critical to maintaining the security of the cluster.

3. **Job Submission Policy**
  * All computational tasks requiring more than 10 minutes, more than 1 CPU core, or more than 2 GB of memory must be submitted to the SLURM scheduling environment.
  * Prohibited Tasks on Login Nodes: Tasks exceeding these limitations will be terminated without notice to preserve system stability for all users. Repeated violations or actions requiring administrator intervention may result in account suspension.

4. **Data Ownership and Management**
  * Data uploaded or stored in the /home, /data, or /general file systems is considered to belong to the individual or entity who approved your GAIVI access. The CSEIT team reserves the right to share your file system with them as necessary.
  * No Backup Guarantee: User data stored on the /home, /data, or /general file systems is NOT backed up or replicated. Users are strongly advised to maintain their own backups.

5. **Acceptable Use**
  * Users must comply with the University’s Acceptable Use Policy. Use of the cluster for for-profit activities, illegal purposes, or any activity violating University policies is strictly prohibited. Violations may result in account suspension or other disciplinary action.

6. **Data Responsibility**
  * The cluster’s storage systems are not intended for personal backups. CSEIT is not responsible for any data loss. Users are encouraged to maintain external copies of critical data.

7. **Account Inactivity**
  * Accounts that remain inactive for more than 6 months may be deactivated. All data under inactive accounts is subject to deletion without notice.

8. **System Disruptions**
  * The cluster is a shared environment. Tasks or behaviors that jeopardize system stability, security, or performance for other users are strictly prohibited. Instances requiring administrative intervention may lead to account suspension.

9. **Agreement to Terms**
  * By entering your password to access the GAIVI Cluster, you acknowledge that you have read, understood, and agreed to abide by these Terms of Service and the System Access Agreement.


===== Priority and Contributions =====
GAIVI offers higher submission priority to teams which have significantly contributed to the development of the cluster. A significant hardware contribution should be worth at least $25k, a standard example being a whole compute node with current Nvidia GPU(s). 
The contributor gains several benefits during the active contribution period, equivalent to the warraty period of the node plus one additional year, these benefits are 1) high priority on jobs sent to the contributed node, 2) access to the Contributors partition, which can submit jobs to all nodes in the cluster at a higher priority level than regular users, and 3) cooling, redundant power, security, and regular maintenance during the active period of the contribution. A year after the warranty period ends the node will be moved to the general partition for all gaivi users to use. For more details on gaivi's partitions see [[main:gaivi:1.manual:start#partitions_of_compute_nodes|the user manual]].


===== System Information =====

^ Node name  ^ Summary of role                    ^ CPU CORES  ^ Processor type and speed                         ^ Memory   ^ Card Info                             ^ GPU Memory  ^
| GAIVI1     | the front node of the cluster      | 12 cores   | Dual Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz   | 64 GB    | -                                     | -           |
| GAIVI2     | the front node of the cluster      | 20 cores   | Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz        | 128 GB   | -                                     | -           |
| GPU1       | compute node with GPUs             | 96 cores   | AMD EPYC 7352 24-Core                            | 256 GB   | 3 * NVIDIA A100                       | 240 GB      |
| GPU2       | compute node with GPUs             | 96 cores   | AMD EPYC 7352 24-Core                            | 256 GB   | 2 * NVIDIA A100                       | 160 GB      |
| GPU3       | compute node with GPUs             | 32 cores   | Dual Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz   | 128 GB   | 4 * 1080 Ti                           | 44 GB       |
| GPU4       | compute node with GPUs             | 32 cores   | Dual Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz   | 128 GB   | 4 * 1080 Ti                           | 44 GB       |
| GPU6       | compute node with GPUs             | 32 cores   | Dual Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz   | 128 GB   | 4 * 1080 Ti                           | 44 GB       |
| GPU7       | compute node with GPUs             | 32 cores   | Dual Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz   | 128 GB   | 4 * 1080 Ti                           | 44 GB       |
| GPU8       | compute node with GPUs             | 32 cores   | Dual Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz   | 128 GB   | 4 * 1080 Ti                           | 44 GB       |
| GPU9       | compute node with GPUs             | 32 cores   | Dual Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz   | 128 GB   | 4 * 1080 Ti                           | 44 GB       |
| GPU11      | compute node with GPUs             | 16 cores   | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz   | 128 GB   | 8 * TITAN X (Maxwell)                 | 96 GB       |
| GPU12      | compute node with GPUs             | 16 cores   | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz   | 128 GB   | 8 * TITAN X (Maxwell)                 | 96 GB       |
| GPU13      | compute node with GPUs             | 16 cores   | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz   | 128 GB   | 8 * TITAN X (Maxwell)                 | 96 GB       |
| GPU14      | compute node with GPUs             | 16 cores   | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz   | 128 GB   | 8 * TITAN X (Maxwell)                 | 96 GB       |
| GPU15      | compute node with GPUs             | 16 cores   | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz   | 384 GB   | 8 * TITAN X (Maxwell)                 | 96 GB       |
| GPU16      | compute node with GPUs             | 16 cores   | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz   | 384 GB   | 4 * TITAN X (Maxwell)                 | 48 GB       |
| GPU17      | compute node with GPUs             | 16 cores   | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz   | 128 GB   | 4 * TITAN X (Maxwell) + 4 *  TITAN V  | 96 GB       |
| GPU18      | compute node with GPUs             | 16 cores   | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz   | 128 GB   | 8 * TITAN X (Pascal)                  | 96 GB       |
| GPU19      | compute node with GPUs             | 16 cores   | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz   | 128 GB   | 8 * TITAN X (Pascal)                  | 96 GB       |
| GPU21      | compute node with GPUs             | 16 cores   | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz   | 128 GB   | 8 * 1080 Ti                           | 88 GB       |
| GPU22      | compute node with GPUs             | 20 cores   | Dual Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz   | 1024 GB  | 8 * 1080 Ti                           | 88 GB       |
| GPU41      | compute node with GPUs             | 16 cores   | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz   | 128 GB   | 4 * TITAN X (Maxwell)                 | 48 GB       |
| GPU42      | compute node with GPUs             | 32 cores   | Dual Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz   | 192 GB   | 4 * Titan RTX                         | 96 GB       |
| GPU43      | compute node with GPUs             | 64 cores   | AMD EPYC 7662 64-Core                            | 512 GB   | 4 * A40                               | 192 GB      |
| GPU44      | compute node with GPUs             | 32 cores   | AMD EPYC 7532 32-Core                            | 512 GB   | 4 * A40                               | 192 GB      |
| GPU45      | compute node with GPUs             | 24 cores   | AMD EPYC 7413 24-Core                            | 256 GB   | 1 * A100                              | 80 GB       |
| GPU46      | compute node with GPUs             | 96 cores   | AMD EPYC 7352 24-Core                            | 2 TB     | 3 * A100                              | 240 GB      |
| GPU47      | compute node with GPUs             | 32 cores   | AMD EPYC 7513 32-Core                            | 512 GB   | 2 * A100                              | 160 GB      |
| GPU48      | compute node with GPUs             | 128 cores  | AMD EPYC 9554 64-Core                            | 768 GB   | 6 * H100                              | 480 GB      |
| GPU49      | compute node with GPUs             | 32 cores   | Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz       | 128 GB   | 2 * A100 + 2 * L40S                   | 256 GB      |
| GPU50      | compute node with GPUs             | 32 cores   | AMD EPYC Genoa 9354 @ 3.3 GHz                    | 384 GB   | 1 * RTX A6000                         | 48 GB       |
| GPU51      | compute node with GPUs             | 32 cores   | AMD EPYC Genoa 9374F @ 3.85 GHz                  | 384 GB   | 2 * RTX 6000 Ada                      | 96 GB       |
| GPU52      | compute node with GPUs             | 32 cores   | AMD EPYC 7543 32-Core @ 2.8GHz                   | 524 GB   | 4 * RTX A6000                         | 182 GB      |
| GPU53      | compute node with GPUs             | 48 cores   | AMD EPYC Milan 7643 @ 2.3GHz                     | 1024 GB  | 8 * L40S                              | 384 GB      |
| PHI1       | compute node with Intel PHI cards  | 32 cores   | Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz   | 128 GB   | 7 * Phi 5110P                         | 56 GB       |
| STORAGE2   | storage node                       | 24 cores   | Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz       | 512 GB   | -                                     | -           |
| STORAGE3   | storage node                       | 16 cores   | Dual Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz  | 196 GB   | -                                     | -           |
| STORAGE4   | storage node                       | 32 cores   | AMD EPYC 7313P 16-Core                           | 128 GB   | -                                     | -           |
| STORAGE6   | storage node (NVMe)                | 32 cores   | AMD EPYC 9354P 32-Core                           | 384 GB   | -                                     | -           |