DeepSpeed
DeepSpeed copied to clipboard
set `device_id` in torch's `init_process_group`
This PR overcomes this issue when using any torch.distributed calls w/ deepspeed:
[W404 00:15:21.693690333 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0
to perform barrier as devices used by this process are currently unknown. This can
potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in
barrier() to force use of a particular device, or call init_process_group() with a device_id.
by setting device_id to the correct device corresponding to LOCAL_RANK env var.
Update: discovered torch.dist deadlocks with torch=>2.7.0 when using device_id arg - switching to draft for now as we can't commit this until we know how to work around this.
@loadams?
@loadams?
Sorry @stas00, I missed this and will review today.
ok, so now we know setting device_id leads to hanging in 2.6.0<torch<2.7.1 https://github.com/pytorch/pytorch/issues/153960
so adapted to that.