Sean Smith

Results 61 comments of Sean Smith

> > > Hey @sean-smith I can remove all modifications of the groups (e.g. docker, sudo). Even though it might be convenient to have this automatic, but its not crucial....

Addressed in https://github.com/aws-samples/awsome-distributed-training/pull/159

Does it need to be a Pytorch script or can it just read from `env`? If the latter we can start from [efa-versions.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/efa-versions.sh) and add recommended versions based on instance...

We can run the command: ```bash srun -N 16 bash -c 'echo "$(hostname): $(date)"' | sort -k2,3 ``` And somehow programmatically make sure the dates are within 1 sec or...

I started working on this here: https://gist.github.com/sean-smith/15980ec0a19109e2778f6540005c896c

@m-ali4721 you can't replace the headnode but you can add a login node that'll act as a jump box to the headnode. See instructions on how to do that here:...

This should be: ```bash PYXIS_VERSION="v0.17.0" wget https://github.com/NVIDIA/pyxis/archive/refs/tags/${PYXIS_VERSION}.zip ```

Already installed here: https://github.com/aws-samples/aws-parallelcluster-post-install-scripts/blob/31a0d4309c9fa6ffcf7c3a27a354fdc8630f3084/pyxis/postinstall.sh#L84C1-L89

Is this still a draft?