DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

set `device_id` in torch's `init_process_group`

Open stas00 opened this issue 7 months ago • 2 comments

This PR overcomes this issue when using any torch.distributed calls w/ deepspeed:

[W404 00:15:21.693690333 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 
to perform barrier as devices used by this process are currently unknown. This can
 potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in
 barrier() to force use of a particular device, or call init_process_group() with a device_id.

by setting device_id to the correct device corresponding to LOCAL_RANK env var.


Update: discovered torch.dist deadlocks with torch=>2.7.0 when using device_id arg - switching to draft for now as we can't commit this until we know how to work around this.

stas00 avatar Apr 30 '25 19:04 stas00

@loadams?

stas00 avatar May 06 '25 18:05 stas00

@loadams?

Sorry @stas00, I missed this and will review today.

loadams avatar May 07 '25 14:05 loadams

ok, so now we know setting device_id leads to hanging in 2.6.0<torch<2.7.1 https://github.com/pytorch/pytorch/issues/153960

so adapted to that.

stas00 avatar Jul 16 '25 00:07 stas00