awsome-distributed-training Add home path to user creation script

Issue #, if available: Adds the feature to specify the user home path during the lifecycle script

Description of changes: In our use case we want to have the user home directories on the shared /fsx volume. This change adapts the script to help in that regard. The goal is to have an automated version of the script specified in https://catalog.workshops.aws/sagemaker-hyperpod/en-US/04-advanced/01-multi-user

This code change does the following:

updates the create user script to allow specifying a custom home path (other than /home/user)
adds the user to the docker group
adds the user to the sudo group
creates SSH key pairs in the home directory if there isn't already one
changes the order in lifecycle_script.py such that user creation happens AFTER mounting of FSX and installation of docker

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Dec 14 '23 14:12 bkulnik-auvaria

Approaching 2 months, shall we close @bkulnik-auvaria ?

Feb 06 '24 14:02 perifaws

@perifaws this is an important patch so if @bkulnik-auvaria can't finish it we need to take it over and update it.

Feb 06 '24 19:02 sean-smith

Hey @sean-smith I can remove all modifications of the groups (e.g. docker, sudo). Even though it might be convenient to have this automatic, but its not crucial. Let me know, if I should remove these lines or close the PR

I actually identified another issue: The home directory of the Ubuntu user is not moved automatically - this creates the issue, that ssh stops working and the new worker nodes can only be reached via srun (which is not ideal, because it is used by jobs most of the time)

Feb 08 '24 16:02 bkulnik-auvaria

Hey @sean-smith I can remove all modifications of the groups (e.g. docker, sudo). Even though it might be convenient to have this automatic, but its not crucial. Let me know, if I should remove these lines or close the PR

I actually identified another issue: The home directory of the Ubuntu user is not moved automatically - this creates the issue, that ssh stops working and the new worker nodes can only be reached via srun (which is not ideal, because it is used by jobs most of the time)

We fix this as one of the first steps in the workshop: https://catalog.workshops.aws/sagemaker-hyperpod/en-US/01-cluster/08-ssh-compute

Feb 08 '24 18:02 sean-smith

Hey @sean-smith I can remove all modifications of the groups (e.g. docker, sudo). Even though it might be convenient to have this automatic, but its not crucial. Let me know, if I should remove these lines or close the PR I actually identified another issue: The home directory of the Ubuntu user is not moved automatically - this creates the issue, that ssh stops working and the new worker nodes can only be reached via srun (which is not ideal, because it is used by jobs most of the time)

We fix this as one of the first steps in the workshop: https://catalog.workshops.aws/sagemaker-hyperpod/en-US/01-cluster/08-ssh-compute

Here the issue is that is manual and - in case there are already long running jobs in the queue and the cluster is scaled up - the nodes are immediately blocked (however I'm not really familiar with srun, so there might be other ways to still do it.)

Feb 12 '24 07:02 bkulnik-auvaria

Hey @sean-smith I can remove all modifications of the groups (e.g. docker, sudo). Even though it might be convenient to have this automatic, but its not crucial. Let me know, if I should remove these lines or close the PR I actually identified another issue: The home directory of the Ubuntu user is not moved automatically - this creates the issue, that ssh stops working and the new worker nodes can only be reached via srun (which is not ideal, because it is used by jobs most of the time)

We fix this as one of the first steps in the workshop: https://catalog.workshops.aws/sagemaker-hyperpod/en-US/01-cluster/08-ssh-compute

Here the issue is that is manual and - in case there are already long running jobs in the queue and the cluster is scaled up - the nodes are immediately blocked (however I'm not really familiar with srun, so there might be other ways to still do it.)

That's true however it should be the first step in a cluster setup before running jobs because the job output won't be synced and ssh and a few other things won't work until this is complete.

Feb 12 '24 17:02 sean-smith

Addressed in https://github.com/aws-samples/awsome-distributed-training/pull/159

May 30 '24 16:05 sean-smith