batch-shipyard icon indicating copy to clipboard operation
batch-shipyard copied to clipboard

'default' should not be allowed as a partition name in slurm.yaml

Open pansapiens opened this issue 6 years ago • 1 comments

Problem Description

If default is used as a single partition name in slurm.yaml (under elastic_partitions:), the slurmctld controller fails to start. /var/log/slurm/slurmctld.log suggests that the PartitionName in slurm.conf is missing/invalid.

It turns out that default in not a valid partition name, but is used for setting defaults for all partitions (eg https://www.mail-archive.com/[email protected]/msg08392.html).

Batch Shipyard Version

3.9.0

Steps to Reproduce

Take a working configuration with a single Batch Pool and single partition, and change the elastic_partition in slurm.yaml to be named default. Provision the cluster and attempt to run an sbatch job (should fail). Login to the controller node (shipyard slurm ssh controller) and determine that slurmctld isn't running, check sudo tail /var/log/slurm/slurmctld.log

Expected Results

Expect shipyard cluster create to fail fast during schema validation if there are partitions named default.

Actual Results

Cluster appears to provision successfully but slurmctld fails to start.

pansapiens avatar Dec 19 '19 03:12 pansapiens

Thanks for the issue report, this will be fixed in the next release.

alfpark avatar Jan 06 '20 16:01 alfpark