FAQ

Selecting Job Resources

When starting a job you are given options for the amount of CPU and RAM that will be allotted to the job. Selecting too few resources can result in the job failing, and selecting too many can result in the job waiting in the queue for resources to become available. We have provided a dashboard to check the status of currently running jobs to see what resources they have consumed so far, it is available here: https://hpc.grit.ucsb.edu/pun/sys/job-efficiency/. This page will provide guidance on what resources to select for the next run of your job.

Jupyterhub

Generally python / jupyterhub jobs will be single threaded and only use a single CPU core unless otherwise specified. Common libraries to check for that enable multi-threading are the following:

NumPy
SciPy
NumExpr
Numba
TensorFlow
PyTorch

If one or more of these libraries are you please do a trial run with 2-4 cores and verify that they are being utilized with the job resource utilization analyzer.

R Studio

Generally R Studio jobs will be single threaded unless a specific library or function are called. Some common things that enable multi-threading are listed below:

OMP_NUM_THREADS
MKL_NUM_THREADS
OPENBLAS_NUM_THREADS
RhpcBLASctl
data.table::setDTthreads
future
parallel
foreach; control with worker count (plan(multisession, workers=…), makeCluster(n)).

Cluster Scratch Directory

There is a scratch directory at /home/hpc-scratch. Note that this is not backed up and data that has not been accessed for 30 days will be AUTOMATICALLY DELETED.

How

Checking muchPast RAMJob didUsage my job use?

~~You can check~~on the ~~resource~~Cluster ~~usage of completed jobs to better allocate resources on for future runs with the following command from any hpc system:~~

Command

sacct -j-units=G <jobid>\
      --format=JobID,JobName,User,State,Elapsed,ReqCPUS,ReqMem,MaxRSS,CPUTimeRAW,AveCPU

~~The output looks like this:~~

JobID             User      State    Elapsed  AllocCPUS     ReqMem     MaxRSS     AveCPU ExitCode
------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- --------
6091           bmemery  COMPLETED   06:51:08          1        32G                            0:0
6091.batch              COMPLETED   06:51:08          1              5331960K   05:36:20      0:0MaxVMSize,AllocCPUS,TotalCPU

This command shows details about previous Slurm jobs, including memory usage, runtime, and CPU allocation.

Column Descriptions

JobID

The unique identifier for the job.

You may see additional job step entries:

12345 — the main job

12345.batch — internal batch step

12345.0 — a job step launched by srun

Most users only need to pay attention to the main job ID.

User

The username that submitted the job.

State

The final status of the job.

Common values include:

State	Meaning
COMPLETED	Job finished normally
FAILED	Job exited with an error code
CANCELLED	User or admin cancelled the job
TIMEOUT	Job exceeded its wall time
OUT_OF_MEMORY	Job exceeded its memory request and was terminated

Elapsed

The total wall-clock runtime of the job.
~~For~~Format ~~example~~may ifbe HH:MM:SS or D-HH:MM:SS.

Examples:

02:34:10 → 2 hours, 34 minutes

1-00:00:00 → 1 day

ReqMem

The amount of memory you ~~want~~requested towhen ~~know~~submitting ~~how~~the ~~much~~job.
This ~~RAM~~is what Slurm reserved for your job.

Examples:

4G — requested 4 gigabytes

0 — no memory explicitly requested

If your job frequently runs out of memory, increase this value in future submissions.

MaxRSS

The maximum actual memory used by the ~~MaxRSS~~job, shown in gigabytes (because of --units=G).

Examples:

1.5G — peak memory use was 1.5 GB

12G — peak use was 12 GB

0.00G — minimal usage or incomplete accounting information

Use this value to compare actual usage with your requested memory.

MaxVMSize

The maximum virtual memory used.
This includes allocated but not necessarily resident memory.

This is ~~what~~mostly ~~you~~useful ~~are~~for ~~looking~~debugging ~~for.~~highly memory-intensive ~~32GB~~applications.
Most users ~~was~~do not need to focus on this field.

AllocCPUS

The number of CPU cores allocated to your job.

Examples:

1 — single-core job

8 — job received 8 cores

This corresponds to values requested ~~above,~~via --cpus-per-task, --ntasks, or submission defaults.

TotalCPU

The total CPU time used across all allocated cores.

For example, if a job runs for 1 hour on 4 cores and keeps all cores fully busy:

TotalCPU = 4:00:00

Interpreting this value:

If TotalCPU is close to AllocCPUS × Elapsed, the job ~~used~~is ~~5331960K~~CPU-bound.
~~(kilobytes),~~

~~which~~

If TotalCPU is ~~about~~much ~~5G, (since 1 GB is 1 Million kilobytes).~~

~~You can get~~smaller, the ~~JobID~~job ~~from~~spent time idle, waiting for I/O, or was lightly loaded.

Example: Show Jobs From the JobLast Resource7 UtilizationDays

~~window,~~

sacct or--starttime=$(date from-d the'-7 Activedays' Jobs+%Y-%m-%d) window\
      (under--units=G ID)\
      for--format=JobID,User,State,Elapsed,ReqMem,MaxRSS,MaxVMSize,AllocCPUS,TotalCPU
example.