Add home path to user creation script
Issue #, if available: Adds the feature to specify the user home path during the lifecycle script
Description of changes:
In our use case we want to have the user home directories on the shared /fsx volume. This change adapts the script to help in that regard. The goal is to have an automated version of the script specified in https://catalog.workshops.aws/sagemaker-hyperpod/en-US/04-advanced/01-multi-user
This code change does the following:
- updates the create user script to allow specifying a custom home path (other than
/home/user) - adds the user to the docker group
- adds the user to the sudo group
- creates SSH key pairs in the home directory if there isn't already one
- changes the order in
lifecycle_script.pysuch that user creation happens AFTER mounting of FSX and installation of docker
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Approaching 2 months, shall we close @bkulnik-auvaria ?
@perifaws this is an important patch so if @bkulnik-auvaria can't finish it we need to take it over and update it.
Hey @sean-smith I can remove all modifications of the groups (e.g. docker, sudo). Even though it might be convenient to have this automatic, but its not crucial. Let me know, if I should remove these lines or close the PR
I actually identified another issue: The home directory of the Ubuntu user is not moved automatically - this creates the issue, that ssh stops working and the new worker nodes can only be reached via srun (which is not ideal, because it is used by jobs most of the time)
Hey @sean-smith I can remove all modifications of the groups (e.g. docker, sudo). Even though it might be convenient to have this automatic, but its not crucial. Let me know, if I should remove these lines or close the PR
I actually identified another issue: The home directory of the Ubuntu user is not moved automatically - this creates the issue, that ssh stops working and the new worker nodes can only be reached via srun (which is not ideal, because it is used by jobs most of the time)
We fix this as one of the first steps in the workshop: https://catalog.workshops.aws/sagemaker-hyperpod/en-US/01-cluster/08-ssh-compute
Hey @sean-smith I can remove all modifications of the groups (e.g. docker, sudo). Even though it might be convenient to have this automatic, but its not crucial. Let me know, if I should remove these lines or close the PR I actually identified another issue: The home directory of the Ubuntu user is not moved automatically - this creates the issue, that ssh stops working and the new worker nodes can only be reached via srun (which is not ideal, because it is used by jobs most of the time)
We fix this as one of the first steps in the workshop: https://catalog.workshops.aws/sagemaker-hyperpod/en-US/01-cluster/08-ssh-compute
Here the issue is that is manual and - in case there are already long running jobs in the queue and the cluster is scaled up - the nodes are immediately blocked (however I'm not really familiar with srun, so there might be other ways to still do it.)
Hey @sean-smith I can remove all modifications of the groups (e.g. docker, sudo). Even though it might be convenient to have this automatic, but its not crucial. Let me know, if I should remove these lines or close the PR I actually identified another issue: The home directory of the Ubuntu user is not moved automatically - this creates the issue, that ssh stops working and the new worker nodes can only be reached via srun (which is not ideal, because it is used by jobs most of the time)
We fix this as one of the first steps in the workshop: https://catalog.workshops.aws/sagemaker-hyperpod/en-US/01-cluster/08-ssh-compute
Here the issue is that is manual and - in case there are already long running jobs in the queue and the cluster is scaled up - the nodes are immediately blocked (however I'm not really familiar with srun, so there might be other ways to still do it.)
That's true however it should be the first step in a cluster setup before running jobs because the job output won't be synced and ssh and a few other things won't work until this is complete.
Addressed in https://github.com/aws-samples/awsome-distributed-training/pull/159