examples icon indicating copy to clipboard operation
examples copied to clipboard

minGPT example does not verify the GPU count across all nodes

Open Michaelvll opened this issue 3 months ago • 0 comments

https://github.com/pytorch/examples/blob/acc295dc7b90714f1bf47f06004fc19a7fe235c4/distributed/minGPT-ddp/mingpt/main.py#L54

The above line check the GPU count for the current process, which makes 2 node with 1 GPU each node fail to run, i.e. the slurm launcher script to fail:

https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/slurm/sbatch_run.sh#L19

Michaelvll avatar Oct 02 '25 23:10 Michaelvll