examples
examples copied to clipboard
minGPT example does not verify the GPU count across all nodes
https://github.com/pytorch/examples/blob/acc295dc7b90714f1bf47f06004fc19a7fe235c4/distributed/minGPT-ddp/mingpt/main.py#L54
The above line check the GPU count for the current process, which makes 2 node with 1 GPU each node fail to run, i.e. the slurm launcher script to fail:
https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/slurm/sbatch_run.sh#L19