2. Slurm Job Scheduler

The Slurm Job Scheduler

All jobs run on the cluster must be submitted through the slurm job scheduler.

Slurm’s purpose is to fairly and efficiently allocate resources amongst the compute nodes available on the cluster. It is imperative that you run your job on the compute nodes by submitting the job to the job scheduler with either sbatch or srun.

*Any resource intensive jobs found running on login nodes are subject to being killed without warning. *

Running jobs:

There are two ways to submit jobs to the cluster:

  • Submitting a script with sbatch
  • Interactively with srun

Checking on Jobs

There are a few commands that will give information on pending, running, and completed jobs.

  • sinfo - view information about Slurm nodes and partitions.

  • scontrol - view Slurm configuration and state.

  • squeue - view information about jobs located in the Slurm scheduling queue.

  • sacct - displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database.

Useful Slurm Commands

  • See which partitions you can submit to

     sinfo 

  • See all of the nodes in a certain partition, nodes listed as down and/or drain cannot be used to run jobs.

     sinfo -p <partitionname> 

  • See information about a job currently in the queue.

     scontrol show jobid <jobid> 

  • See information about node(s), useful when combined with scontrol show job

     scontrol show node <nodelist> 
    Note: You can pass either a single node or a nodelist (node[001-010]) to this.

  • See information about a partition, useful for things like finding out the MaxWallTime or the number of nodes of a partition.

     scontrol show partition <partitionname> 

  • Look up the status of a job by jobID

     squeue -j <jobid> 

  • See the status of your jobs in the queue

     squeue -u <username> 

  • See when Slurm thinks your job will start, and some reasons why it is waiting

     squeue -u <username> --start

    Possible reasons your job is pending:

    • (Resources) - Job is waiting for compute nodes to become available
    • (Priority) - Jobs with a higher priority score are waiting compute nodes.
    • (ReqNodeNotAvail) - The compute nodes requested by the job are not available, might be due to things such as cluster maintenance, nodes are down/offline, scheduler is backed up with too many jobs, or the config you requested just doesn’t exist.
  • Get some information on a specific job that has completed.

    sacct -j <jobid> --format=JobID,JobName,Elapsed,State

  • Get some information on a specific job that has completed today, you can use the -S flag to specify a start time to see stats on jobs from further back.

    sacct -u <username> --format=JobID,JobName,Elapsed,State  

Further documentation on Slurm commands:

If you have any questions about using slurm, please email orcd-help-engaging@mit.edu