GPU Resources

The HPC cluster has ~~a total of 32~~ NVIDIA L40S ~~GPUs,~~GPUs and an NVIDIA A30 spread across various hosts and resources.

Interactive Apps

To access ~~these~~GPU resources via the Open OnDemand web uiUI, ~~simply check~~use the ~~Enable NVIDIA~~ GPU ~~box~~options at the bottom of the interactive session ~~form~~form.

~~and~~

There are two GPU request modes:

Full GPU — reserves an entire GPU for your ~~interactive~~job. ~~session~~Use ~~will~~this befor ~~started~~large ~~with~~training ~~access~~jobs, toGPU-heavy applications, or jobs that need most or all GPU memory.

Shared GPU shards — requests a portion of a GPU. Use this for light interactive GPU work, testing, notebooks, MATLAB GPU checks, or jobs that do not need an entire GPU.

GPU shards allow multiple jobs to share the same physical GPU. Shards are scheduled by Slurm, but they are not the same as NVIDIA MIG and do not provide hard GPU memory isolation. If your job may use a large amount of GPU memory, request a full GPU instead.

SLURM CLI

GPU resources can also be ~~access~~accessed via the SLURM CLI. Below are some ~~examples:~~examples.

Request a full exclusive GPU:

#!/bin/bash
#SBATCH -J gpu-l40s-test
#SBATCH -p grit_nodes
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH -t 01:00:00

nvidia-smi
<your command here>

orFull asGPU ~~a one~~ one-liner:

srun -p grit_nodes --gres=gpu:1 --cpus-per-task=4 --mem=16G --pty <your command here>

Request shared GPU shards:

#!/bin/bash
#SBATCH -J gpu-shard-test
#SBATCH -p grit_nodes
#SBATCH --gres=shard:4
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH -t 01:00:00

nvidia-smi
<your command here>

Shared GPU shard one-liner:

srun -p grit_nodes --gres=shard:4 --cpus-per-task=4 --mem=16G --pty <your command here>

Notes

~~The~~ GPU resources work ~~a little~~ differently ~~in SLURM than the~~from CPU and RAM resources. ~~GPU's~~A ~~cannot~~request besuch ~~exclusively~~as ~~reserved~~--gres=gpu:1 inreserves a full GPU for the ~~current~~job. ~~setup~~A ~~because~~request wesuch ~~have~~as ~~limited~~--gres=shard:4 requests shared GPU ~~resources~~capacity and ~~SLURM~~allows ~~cannot~~multiple ~~reserve~~jobs ~~any~~to ~~less than~~use the ~~resources~~same ~~of the full~~physical GPU.

Use ~~jobs~~--gres=gpu:1 ~~submitted~~when you need exclusive access to a GPU ~~node~~or ~~may~~expect ~~be sharing~~heavy GPU ~~resources~~memory usage. Use --gres=shard:<number> for lighter workloads that can share a GPU with other jobs.

~~This~~

Shared GPU shards are intended to improve access to limited GPU resources. They are not a guarantee of fixed GPU performance or isolated GPU memory. If another shard job on the same GPU is busy, your job may ~~change as we~~ see ~~increased~~reduced ~~use~~GPU ofperformance.

~~GPUs.~~

You can check which GPU Slurm exposed to your job with:

echo $CUDA_VISIBLE_DEVICES
nvidia-smi