awsome-distributed-training
awsome-distributed-training copied to clipboard
Enable autoresume for all Slurm examples
We should add the following snippet to all Slurm examples so that if it's a hyperpod cluster it'll automatically add the --auto-resume=1 flag. This needs to be tested for all examples, see https://github.com/aws-samples/awsome-distributed-training/pull/231 for an example.
AUTO_RESUME=""
if [ -d "/opt/sagemaker_cluster" ]; then
echo "Detected Hyperpod cluster.. enabling --auto-resume=1"
AUTO_RESUME="--auto-resume=1"
fi
srun ${AUTO_RESUME}
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.