Slurm Usage
[[Category:HPC]] [[Category:UserDocs]]
== Quick Introduction == Our more administrative oriented docs are at: [[Slurm]]
A queue in Slurm is called a partition. User commands are prefixed with '''s'''.
=== Useful Commands ===
- sacct, sbatch, sinfo, sprio, squeue, srun, sshare, sstate etc... sbatch # sends jobs to the slurm queue sinfo # general info about slurm squeue # inspect queue sinfo -lNe # more detailed info reporting with long format and nodes listed individually scancel 22 # cancel job 22 scontrol show job 2 # show control info on job 2
Examples:
# find the que names: [user@computer ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST basic* up infinite 1 idle # test a job submission (don't run) sbatch --test-only slurm_test.sh # run a job sbatch slurm_test.sh
=== Example Slurm job file ===
#!/bin/bash ## SLURM REQUIRED SETTINGS #SBATCH --partition=basic #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 ## SLURM reads %x as the job name and %j as the job ID #SBATCH --output=%x-%j.out #SBATCH --error=%x-%j.err # Output some basic info with job pwd; hostname; date; # requires ED2_HOME env var to be set cd $ED2_HOME/run # Job to run ./ed2
Another Example:
#!/bin/bash # #SBATCH -p basic # partition name (aka queue) #SBATCH -c 1 # number of cores #SBATCH --mem 100 # memory pool for all cores #SBATCH -t 0-2:00 # time (D-HH:MM) #SBATCH -o slurm.%N.%j.out # STDOUT #SBATCH -e slurm.%N.%j.err # STDERR # code or script to run for i in {1..100000}; do echo $RANDOM >> SomeRandomNumbers.txt donesort SomeRandomNumbers.txt
====Python Example==== The output goes to a file in your home directory called hello-python-*.out, which should contain a message from python.
#!/bin/bash ## SLURM REQUIRED SETTINGS1G #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 ## SLURM reads %x as the job name and %j as the job ID #SBATCH --output=%x-%j.out #SBATCH --error=%x-%j.err #SBATCH --job-name=hello-python # create a short name for your job #SBATCH --time=00:01:00 # total run time limit (HH:MM:SS) ## Example use of Conda: # first source bashrc (with conda.sh), then conda can be used source ~/.bashrc # make sure conda base is activated conda activate # Other conda commands go here ## run python python hello.py
hello.py should be something like this:
print('Hello from python!')
=== Computer Facts === Find out facts about the computer for the job file
# number of cores? grep 'cpu cores' /proc/cpuinfo | uniq # memory [emery@bellows ~]$ free -h total used free shared buff/cache available Mem: 1.5Ti 780Gi 721Gi 1.5Gi 8.6Gi 721Gi Swap: 31Gi 0B 31Gi
=== nodes vs tasks vs cpus vs cores === Here's a very good writeup: https://researchcomputing.princeton.edu/support/knowledge-base/scaling-analysis. For most of our use cases, one node and one task is all that is needed (More than this requires special code such as mpi4py (MPI = Message Passing Interface).
#SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=N
is the correct way to request N cores for a job. Just replace N in that config with the number of cores you need
To get the max value for N for a computer:
scontrol show node | grep CPU
produces 'CPUTot'
Quoting directly from: https://login.scg.stanford.edu/faqs/cores/ Also useful: https://stackoverflow.com/questions/65603381/slurm-nodes-tasks-cores-and-cpus
=== See Also === https://www.carc.usc.edu/user-information/user-guides/hpc-basics/slurm-templates
https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands/
https://csc.cnsi.ucsb.edu/docs/slurm-job-scheduler
Python: https://rcpedia.stanford.edu/topicGuides/jobArrayPythonExample.html
Finding info for slurm.conf To find the number of CPUs, SocketsPerBoard, and CoresPerSocket on Ubuntu 20, you can use the following commands:
-
To find the number of CPUs: grep -c ^processor /proc/cpuinfo
-
To find the number of sockets per board: sudo dmidecode -t 4 | grep "Socket Designation" | awk -F: '{print $2}' | uniq | wc -l
This command will output the number of unique socket designations found in the dmidecode output, which should correspond to the actual number of physical sockets on your motherboard.
- To find the number of cores per socket: lscpu | grep "Core(s) per socket" | awk '{print $4}'