CLI Usage
Getting on the cluster
-
SSH:
ssh hpc.grit.ucsb.edu
You’ll land on a compute node inside a Slurm-backed interactive session (the login host forwards you automatically). -
File transfers:
scp,rsync, sftp, etc, still work as usual tohpc.grit.ucsb.edu.
Partitions you can use
-
grit_nodes(default) – general use; includes CPU nodes and GPU-capable nodes. -
Other partitions exist but are group-restricted.
Resource basics (what Slurm expects)
-
CPUs:
-c <cores>per task, or-n <ntasks>total tasks. -
Memory:
-
--mem=<MB|GB>= per-node memory, or -
--mem-per-cpu=<MB|GB>= per allocated CPU.
-
-
Time:
-t D-HH:MM:SS(set this realistically; backfill favors shorter jobs). -
Partition:
-p grit_nodes(default). -
GPU: use
--gres=shard:<number>for shared GPU access. See the GPU section below.
On this cluster the default memory per CPU is 4 GB if you don’t specify otherwise.
GPU resources
The cluster has NVIDIA GPUs available on selected grit_nodes hosts. GPU access is scheduled through Slurm using GPU shards.
A GPU shard is a scheduled share of a physical GPU. Shards allow multiple jobs to use the same GPU when they do not need the entire device. Shards are useful for Jupyter notebooks, RStudio, MATLAB GPU checks, code-server sessions, small CUDA tests, and lighter interactive GPU work.
Important: GPU shards are not the same as NVIDIA MIG. They do not provide hard GPU memory isolation. If another shard job on the same GPU is busy, your job may see reduced GPU performance or available GPU memory.
For GPU jobs, request shards with:
--gres=shard:4
Common shard sizes:
--gres=shard:1
--gres=shard:2
--gres=shard:4
--gres=shard:8
--gres=shard:16
--gres=shard:32
--gres=shard:48
Use a larger shard value for heavier GPU workloads. The 48 GB GPUs are generally best treated as up to shard:48 for a full-device-sized request, though this is still scheduled as shared GPU capacity rather than MIG-style isolation.
Inside a GPU job, check what Slurm exposed with:
echo "$CUDA_VISIBLE_DEVICES"
nvidia-smi
See what’s available / what you’re running
sinfo -p grit_nodes -Nel # nodes, CPUs, memory, state
sinfo -N -p grit_nodes -o "%20N %10T %14C %10m %20G %R" # include GRES/GPU info
squeue -u $USER # your jobs
squeue --start # scheduler’s predicted start times
To see GPU/shard requests in the queue:
squeue -o "%.18i %.10P %.20u %.30j %.8T %.10M %.20b %R"
One-off noninteractive command
srun -p grit_nodes -c 2 --mem=8G -t 30:00 myprog --arg foo
One-off command with shared GPU shards:
srun -p grit_nodes -c 4 --mem=16G --gres=shard:4 -t 30:00 nvidia-smi
Batch jobs (recommended for longer runs)
Create a script job.sh:
#!/bin/bash
#SBATCH -p grit_nodes
#SBATCH -c 8
#SBATCH --mem=64G
#SBATCH -t 12:00:00
#SBATCH -J myjob
#SBATCH -o slurm-%j.out
module load mytool # if you use environment modules
python train.py --epochs 10
Example batch job with shared GPU shards:
#!/bin/bash
#SBATCH -p grit_nodes
#SBATCH -c 4
#SBATCH --mem=16G
#SBATCH --gres=shard:4
#SBATCH -t 01:00:00
#SBATCH -J gpu-test
#SBATCH -o slurm-%j.out
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
nvidia-smi
python gpu_test.py
Submit + check:
sbatch job.sh
squeue -u $USER
tail -f slurm-<jobid>.out
Job arrays (many similar runs)
sbatch --array=0-99 -p grit_nodes -c 2 --mem=8G -t 1:00:00 job.sh
Inside job.sh use $SLURM_ARRAY_TASK_ID to index your inputs.
Cancel / modify
scancel <jobid> # cancel one
scancel -u $USER # cancel all yours
scontrol update JobId=<jobid> TimeLimit=02:00:00 # shorten time limit
Accounting & live stats
sacct -j <jobid> --format=JobID,JobName,State,Elapsed,Timelimit,MaxRSS,ReqMem,AllocCPUS,ReqTRES,AllocTRES
sstat -j <jobid>.batch --format=AveCPU,AveRSS,MaxRSS,MaxVMSize,TaskCPU
Common “why is my job pending?” reasons
-
(Resources): not enough free CPUs, memory, GPU shards, or a compatible node right now. Try shorter-t, fewer CPUs, less--mem, or fewer GPU shards. -
(Priority): your job is eligible, but other pending jobs currently have higher Slurm priority. -
(BeginTime): Slurm reserved a future start window for your job. Lower-tor resources to start sooner, or runsqueue --startto see the ETA. -
Constraints or node eligibility: very large per-node requests (CPUs,
--mem, or GPU shards) may only fit on the biggest nodes, which can lengthen wait time.
Good citizenship / performance tips
-
Prefer multiple smaller tasks over one huge single-node grab when you can.
-
Keep single-node requests well under a node’s total RAM/cores unless you truly need them.
-
Use smaller GPU shard requests for light interactive work. Do not request large shard counts unless your workload really needs them.
-
Set realistic time limits; the backfill scheduler starts shorter jobs sooner.
Examples you can paste
# 1) CPU-only batch job with array:
sbatch -p grit_nodes -c 2 --mem=8G -t 2:00:00 --array=1-50 run_sim.sh
# 2) Memory-per-CPU style (8 CPUs × 6 GB each = 48 GB/node):
srun -p grit_nodes -c 8 --mem-per-cpu=6G -t 1:00:00 --pty bash -l
# 3) Light shared-GPU interactive test:
srun -p grit_nodes -c 4 --mem=16G --gres=shard:4 -t 1:00:00 --pty bash -l
# 4) Heavier shared-GPU request:
sbatch -p grit_nodes -c 8 --mem=64G --gres=shard:32 -t 4:00:00 gpu_job.sh
# 5) Check predicted start times:
squeue --start
FAQ for this cluster
-
Do I need to
sallocfirst? No. SSH gives you a Slurm-backed shell. Usesrunfor bigger interactive bursts, orsbatchfor long runs. -
VS Code / PyCharm remote? Not supported on the login host; use terminal + Slurm (
srun/sbatch) instead, or use the Open OnDemand code-server app. -
Which partition do I use?
grit_nodesunless you were explicitly added to a project-specific partition. -
How do I request a GPU? Use
--gres=shard:<number>, for example--gres=shard:4. -
Are GPU shards exclusive? No. Shards are shared GPU scheduling units. They help share limited GPU resources, but they do not provide hard GPU memory isolation.
If you paste a specific job command you’re planning to run, I’ll check it against the node sizes here and suggest the best flags (CPUs/--mem/-t/--gres).