Slurm Usage
[[Category:HPC]]
[[Category:UserDocs]]
Intro
==SLURM Quick(Simple IntroductionLinux ==Utility Ourfor moreResource administrativeManagement) orientedis docsa arewidely at:used [[Slurm]]
Ajob queuescheduler inthat we use on GRIT HPC systems to allocate resources efficiently. This guide providies some basic information on how to use Slurm isand calledcreate ascripts partition.for User commands are prefixed with '''s'''.
=== Useful Commands ===
sacct, sbatch, sinfo, sprio, squeue, srun, sshare, sstate etc... sbatch # sendssubmitting jobs to theslurmSlurmqueuequeue.sinfoTypical
#Workflowgeneral- Develop
aboutyour program (e.g. on your computer and a subset of data) - Update your program for use on HPC (e.g. change data paths if needed, etc.)
- Create a slurm
squeuejob#fileinspect(see below) - Submit your job to the queue
- Monitor
-lNethe#jobmorestatus, wait for completion
infosinfoSteps 3-5 are detailed
info reporting with long format and nodes listed individually scancel 22 # cancel job 22 scontrol show job 2 # show control info on job 2- Develop
Examples:below.
# find the que names: [user@computer ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST basic* up infinite 1 idle # test a job submission (don't run) sbatch --test-only slurm_test.sh # run a job sbatch slurm_test.sh
===
Example Slurm job filefiles
Slurm job files are writting in bash, which is a linux shell scripting language. Here's an example which uses one cpu on one computer to run a simple job, outputting any errors or other outputs to log files in the same directory. Note that on most GRIT HPC systems the main queue (aka partition in Slurm) is called 'basic'.
#!/bin/bash ## SLURM REQUIRED SETTINGS <--- two hashtags are a comment in Slurm #SBATCH --partition=basic #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 ## SLURM reads %x as the job name and %j as the job ID #SBATCH --output=%x-%j.out #SBATCH --error=%x-%j.err #
Output some basic info with job pwd; hostname; date; # requires ED2_HOME env var to be set cd $ED2_HOME/run #Job to run ./ed2my_example_code.bash
Another Example:
#!/bin/bash # #SBATCH -p basic # partition name (aka queue) #SBATCH -c 1 # number of cores #SBATCH --mem 100 # memory pool for all cores #SBATCH -t 0-2:00 # time (D-HH:MM) #SBATCH -o slurm.%N.%j.out # STDOUT #SBATCH -e slurm.%N.%j.err # STDERR # code or script to run for i in {1..100000}; do echo $RANDOM >> SomeRandomNumbers.txt donesort SomeRandomNumbers.txt
====
Python Example====Example with Conda
The output goes to a file in your home directory called hello-python-*.out, which should contain a message from python.
#!/bin/bash ## SLURM REQUIRED SETTINGS1G #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 ## SLURM reads %x as the job name and %j as the job ID #SBATCH --output=%x-%j.out #SBATCH --error=%x-%j.err #SBATCH --job-name=hello-python # create a short name for your job #SBATCH --time=00:01:00 # total run time limit (HH:MM:SS) ## Example use of Conda: # first source bashrc (with conda.sh), then conda can be used source ~/.bashrc # make sure conda base is activated conda activate # Other conda commands go here ## run python python hello.py
hello.py should be something like this:
print('Hello from python!')
Adding details to Slurm job files
===These Computerexamples Factsare ===all Findvery simple, so here are some useful commands for adding more complexity, such as more memory, more CPU's etc. Adding these requires finding out facts about the computer for the job filefile:
Find
#the number ofcores?CPU cores on a computer from the command line:[user@computer ~]$ grep 'cpu cores' /proc/cpuinfo | uniq
#cpu cores : 48 <---- an example outputFind out how much memory a computer has:
[
emery@bellowsuser@computer ~]$ free -h total used free shared buff/cache available Mem: 1.5Ti 780Gi 721Gi 1.5Gi 8.6Gi 721Gi Swap: 31Gi 0B 31Gi
=== nodes vs tasks vs cpus vs cores === Here's a very good writeup: https://researchcomputing.princeton.edu/support/knowledge-base/scaling-analysis.For most of our use cases, one node and one task is all that isneededneeded.(More than this requires special code such as mpi4py (MPI = Message Passing Interface), or the Parallel computing toolbox such as with MATLAB (which uses --cpus-per-task). To request N cores for a job, just replace N with the number of cores you need in the Slurm job file, such as:#SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=N
is the correct way to request N cores for a job. Just replace N in that config with the number of cores you needTo get the max value for N for a computer:
[user@computer ~]$ scontrol show node | grep CPU CPUAlloc=20 CPUTot=95 CPULoad=1.00
produces'CPUTot' is the max value for N.
QuotingFinddirectlythefrom:queuehttps://login.scg.stanford.edu/faqs/cores/names:Also[user@computer
useful:~]$https://stackoverflow.com/questions/65603381/slurm-nodes-tasks-cores-and-cpussinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST basic* up infinite 1 idle <--- in this case the queue name is 'basic' and it's the default as indicated by the *Submitting your job to the queue
Assuming you have a Slurm job file named slurm_test.sh:
# test a job submission (don't run) [user@computer ~]$ sbatch --test-only slurm_test.sh # run a job [user@computer ~]$ sbatch slurm_test.sh
Monitoring your job
Examples of how this is done:
[user@computer ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 166626 basic my_code username PD 0:00 1 (Resources) 166627 basic my_code username R 3:04 1 anvil
In the above, 'R' denotes that the job is running, 'PD' denotes that Slurm is waiting for resources.
===YouSeecanAlsoalso===monitor the output by watching the log files from the command line. This will show the last few lines of the log file and update as the log file changes:[user@computer ~]$ tail -f log-file-name.txt
Cancel the job if needed:
[user@computer ~]$ scancel 22 # cancel job 22
You can get the job number from squeue (e.g. JOBID).
Other Useful Commands
sinfo # general info about slurm sinfo -lNe # more detailed info reporting with long format and nodes listed individually scontrol show job 2 # show control info on job 2
To find the number of cores per socket:
lscpu | grep "Core(s) per socket" | awk '{print $4}'
Other References
https://www.carc.usc.edu/user-information/user-guides/hpc-basics/slurm-templates
https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands/
https://csc.cnsi.ucsb.edu/docs/slurm-job-scheduler
Python: https://rcpedia.stanford.edu/topicGuides/jobArrayPythonExample.html
Finding info for slurm.conf To find the number of CPUs, SocketsPerBoard, and CoresPerSocket on Ubuntu 20, you can use the following commands:
To find the number of CPUs: grep -c ^processor /proc/cpuinfo
To find the number of sockets per board: sudo dmidecode -t 4 | grep "Socket Designation" | awk -F: '{print $2}' | uniq | wc -l
This command will output the number of unique socket designations found in the dmidecode output, which should correspond to the actual number of physical sockets on your motherboard.
https://login.scg.stanford.edu/faqs/cores/
Tohttps://stackoverflow.com/questions/65603381/slurm-nodes-tasks-cores-and-cpusfindRegarding
thenodesnumbervsoftaskscoresvspercpussocket:vslscpucores:|Here'sgrepa"Core(s)verypergoodsocket"writeup:|https://researchcomputing.princeton.edu/support/knowledge-base/scaling-analysis.awk '{print $4}'