DeepSpeed set `device_id` in torch's `init_process

This PR overcomes this issue when using any torch.distributed calls w/ deepspeed:

[W404 00:15:21.693690333 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 
to perform barrier as devices used by this process are currently unknown. This can
 potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in
 barrier() to force use of a particular device, or call init_process_group() with a device_id.

by setting device_id to the correct device corresponding to LOCAL_RANK env var.

Update: discovered torch.dist deadlocks with torch=>2.7.0 when using device_id arg - switching to draft for now as we can't commit this until we know how to work around this.

Apr 30 '25 19:04 stas00

@loadams?

May 06 '25 18:05 stas00

@loadams?

Sorry @stas00, I missed this and will review today.

May 07 '25 14:05 loadams

ok, so now we know setting device_id leads to hanging in 2.6.0<torch<2.7.1 https://github.com/pytorch/pytorch/issues/153960

so adapted to that.

Jul 16 '25 00:07 stas00

set `device_id` in torch's `init_process_group`