aws-eda-slurm-cluster
aws-eda-slurm-cluster copied to clipboard
Documentation corrections required on deploy-parallel-cluster documentation page
On the page
https://aws-samples.github.io/aws-eda-slurm-cluster/deploy-parallel-cluster/
Three issues that I ran into:
-
The Create users_groups.json secion has a duplicate of the table used later in "Configure submission hosts to use the cluster". It doesn't belong here..
-
The Description for the Config Stack Output states that Command01SubmitterMountHeadNode will "adds it to /etc/fstab". It does not. The command just mounts the file system:
head_ip=head_node.<clusterName>.pcluster && sudo mkdir -p /opt/slurm/<clusterName> && sudo mount $head_ip:/opt/slurm /opt/slurm/<clusterName>
(I've replaced my clusterName with <clusterName>)
- After I have run the ansible playbook, I tried to load the module as specified in "Run Your First Job". This did not work:
$ module load <clusterName> ERROR: Unable to locate a modulefile for '<clusterName>'
I had to logout and log back in to get my environment set correctly to allow the module to be loaded and do its magic.
For item 2 above, I have found that running the ansible playbook adds the mount to my /etc/fstab. So you just have the comment in the wrong Description section...
The tables aren't duplicates. Only the first item is the same. However, the name of the command is misleading. So I renamed it from Command01_SubmitterMountHeadNode to Command01_MountHeadNodeNfs.
I updated the description for the 1st and 2nd commands in both tables. The /etc/fstab update occurs in the 2nd step when the ansible playbook is run.
When the new modulefile is created you need to create a new shell to refresh the environment or source the shell config again.