Slurm Workload Manager¶
Note
Slurm support is in beta. Feel free to use it, but we suggest contacting our support before putting anything into production.
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system. Slurm consists of various services which are represented by separate Flying Circus roles documented below.
The remainder of this documentation assumes that you are aware of the basics of Slurm and understand the general terminology.
We provide version 23.02.x.x of Slurm.
Basic architecture and roles¶
Warning
Keep in mind that Slurm is built to execute arbitrary commands on any Slurm node with the permissions of the user starting the command on possibly another machine. Keep sensitive data away from Slurm nodes and isolate Slurm as much as possible, using a dedicated resource group without other applications.
You can run one Slurm cluster per resource group. We generally recommend to use separate clusters (and thus separate resource groups) for independent projects. This will give you the most flexibility and will integrate optimally into our platform aligning well on topics like access management, monitoring, SLAs, maintenance, etc.
A resource group with Slurm roles can have additional machines which provide additional services which may be needed to run jobs in Slurm. Such required machines can also be included in the coordination of automated maintenance as described later.
Machine authentication is handled by munge
, using a shared secrets generated
by our central management directory. New worker nodes are automatically added
to existing clusters.
slurm-controller¶
Note
For new clusters, it’s recommended to first set up a controller and add nodes after that. The controller service will only start if there’s at least one node.
Note
For autoconfiguration, all slurm machines must have the same amount of memory and CPU cores. If that’s not the case, memory and CPU cores must be set manually. See the Configuration reference on how to do that.
This role runs slurmctld. We add basic Cluster readiness monitoring via Sensu and telemetry via Telegraf which can be ingested by a Statshost and displayed using a Grafana dashboard.
At the moment, we only support exactly one controller per cluster.
Maintenance of a machine with this role means that all worker nodes are
drained and set to down
first. Maintenance activities only start when no jobs
are running anymore in the whole cluster.
After finishing a platform management task run (which happens every 10 minutes),
the controller sets all nodes to ready
that have been set to down
by an
automated maintenance if the nodes and all external dependency machines are not
in maintenance.
slurm-dbdserver¶
Note
At the moment, this role must run on the same machine as slurm-controller
.
Runs slurmdbd
which is needed for job accounting. Automatically sets up a
MySQL database with our platform defaults and
monitoring/telemetry.
slurm-node¶
Runs slurmd
which is responsible for processing jobs. There should be
multiple nodes in your cluster for production use but applying this role to a
machine which is also running the controller services is also supported for
testing purposes.
Nodes must be ready
to accept jobs. The corresponding Slurm states are IDLE
when the node isn’t processing jobs at the moment or MIXED/ALLOCATED if some
or all of its cores are in use at the moment, respectively.
Before running maintenance activities, the node is drained and stops accepting new
jobs. Nodes don’t set themselves to ready
after maintenance. Instead, the
controller activates nodes which are not in maintenance anymore after its own
platform management task run (every 10 minutes).
Warning
Nodes that had an unexpected reboot or have been drained/downed manually
are not set to ready
automatically by the platform management task. You
have to do that manually using one of the ready
subcommands described in
Managing clusters with the fc-slurm command.
slurm-external-dependency¶
This role does not provide any Slurm services but something that is needed to
run jobs via Slurm, for example a database accessed by job scripts. When such
machines go into maintenance, all nodes are drained first, like for a
controller maintenance. After the external dependency machine has finished
maintenance, the next run of the platform management task on the controller will set
the nodes to ready
.
Cluster interaction using Slurm commands¶
The usual Slurm commands are installed globally on every Slurm machine.
In general, all users can run slurm commands on all machines with a slurm-*
role. Some commands require the use of sudo -u slurm
to run as slurm user.
This is allowed for(human) user accounts with the sudo-srv
permission
without password.
Use slurm-readme
to show dynamically-generated documentation specific for
this machine.
Managing clusters with the fc-slurm command¶
Use fc-slurm to manage the state of slurm compute nodes and display status information about the cluster.
This command is also used by our platform management task before and after maintenance, as well as to fetch telemetry data from Slurm and running monitoring checks.
Some subcommands that modify state require sudo
. This is allowed for
(human) user accounts with the sudo-srv
permission without password.
The output and availability of subcommands depends on the role of the machine.
Global Node Management¶
The fc-slurm all-nodes
subcommand can be run on every machine with a slurm
role and operates on all nodes in the cluster.
Mark all nodes as ready
:
sudo fc-slurm all-nodes ready
This is needed when nodes are out because they had an unexpected reboot or have been drained/downed manually.
Note
all-nodes ready
skips nodes that are still in maintenance.
You can specify a reason
to restrict the affected nodes. Their reason
for being in a down
state must contain the given string:
sudo fc-slurm all-nodes ready --reason-must-match "my node maintenance"
Drain all nodes (no new jobs allowed) and set them to down
afterwards:
sudo fc-slurm all-nodes drain-and-down --reason "my global maintenance"
Dump node state info as JSON:
fc-slurm all-nodes state
Single Node Management¶
Manage the state of nodes individually, by running fc-slurm
directly on the node:
sudo fc-slurm drain-and-down --reason "my node maintenance"
sudo fc-slurm ready
Check the state of the node, also used by the slurm
Sensu check:
fc-slurm check
Controller Management¶
Controllers don’t have management commands that affect their state at the
moment but you can run fc-slurm all-nodes
on controller machines or look
at check output.
Check the state of the controller and all nodes, also used by the slurm
Sensu check:
fc-slurm check
Command Cheat sheet¶
Set all nodes to ready:
sudo fc-slurm all-nodes ready
View the dynamically-generated documentation for a machine:
slurm-readme
Show the current configuration:
slurm-show-configuration
Show running/pending jobs
squeue
Show partition state:
sinfo
Show node info:
sinfo -N
Find jobs with high RAM consumption for all users (with custom output format):
sudo -u slurm sacct --format="MaxRSS,State,JobID,JobName,User,CPUTime" -a | awk 'NR <= 2; NR > 2 {print $0 | "sort -n"}'
Cancel jobs with a given job name:
sudo -u slurm scancel -n jobname
Known limitations¶
For autoconfiguration, all nodes and the controller must have the same amount of memory and CPU cores. If that’s not the case, memory and CPU must be set manually via Nix config to the same value on all Slurm machines because Slurm expects the config file to be the same everywhere.
slurm-dbdserver
andslurm-controller
roles must be on the same machine.we support only one
slurm-controller
per cluster at the moment.
Configuration reference¶
Warning
Memory and CPU cores must be set to the same value on all Slurm machines because Slurm expects the config file to be the same everywhere.
This also applies to machines that don’t have the slurm-node
role
even if the memory and CPU settings have no effect there.
flyingcircus.slurm.realMemory
Memory in MiB used by a slurm compute node.
flyingcircus.slurm.cpus
Number of CPU cores used by a slurm compute node.
flyingcircus.slurm.clusterName
Name of the cluster. Defaults to the name of the resource group.
The cluster name is used in various places like state files or accounting table names and should normally stay unchanged. Changing this requires manual intervention in the state dir or slurmctld will not start anymore!
flyingcircus.slurm.partitionName
Name of the default partition which includes the machines defined via the nodes
option.
Don’t use default
as partition name, it will fail!
flyingcircus.slurm.accountingStorageEnforce
This controls what level of association-based enforcement to impose on job
submissions. Valid options are any combination of associations, limits,
nojobs
, nosteps
, qos
, safe
, and wckeys
, or all for all things
(except nojobs
and nosteps
, which must be requested as well). If
limits
, qos
, or wckeys
are set, associations will automatically be
set.
By setting associations, no new job is allowed to run unless a corresponding association exists in the system. If limits are enforced, users can be limited by association to whatever job size or run time limits are defined.
flyingcircus.slurm.nodes
Names of the nodes that are added to the automatically generated partition.
By default, all Slurm nodes in a resource group are part of the partition
called partitionName
.
services.slurm.extraConfig
Extra configuration options that will be added verbatim at the end of the slurm configuration file.
services.slurm.dbdserver.extraConfig
Extra configuration for slurmdbd.conf
See also:
slurmdbd.conf(8).
Example custom local config¶
{ ... }:
{
flyingcircus.slurm = {
accountingStorageEnforce = true;
partitionName = "processing";
realMemory = 62000;
cpus = 16;
};
services.slurm.extraConfig = ''
AccountingStorageEnforce=associations
'';
}