awsome-distributed-training icon indicating copy to clipboard operation
awsome-distributed-training copied to clipboard

Enable autoresume for all Slurm examples

Open sean-smith opened this issue 1 year ago • 1 comments

We should add the following snippet to all Slurm examples so that if it's a hyperpod cluster it'll automatically add the --auto-resume=1 flag. This needs to be tested for all examples, see https://github.com/aws-samples/awsome-distributed-training/pull/231 for an example.

AUTO_RESUME=""
if [ -d "/opt/sagemaker_cluster" ]; then
    echo "Detected Hyperpod cluster.. enabling --auto-resume=1" 
    AUTO_RESUME="--auto-resume=1"
fi

srun ${AUTO_RESUME}

sean-smith avatar Apr 01 '24 01:04 sean-smith

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Jun 30 '24 01:06 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Aug 29 '24 01:08 github-actions[bot]