Zhi-Kai Xu

Results 4 comments of Zhi-Kai Xu

@jomayeri Sure. For the setting of 4 A100s, they have NVLink interconnecting them. But no matter if NCCL_P2P_DISABLE=1 or not, the hanging always occur. ![topo_on_TWCC](https://github.com/microsoft/DeepSpeed/assets/72068886/07290cba-216f-47ec-bdc3-5fa7aa869f97) Here is another issue. After...

Thank for your reply! @jomayeri If I run training without DeepSpeed (use 4 V100 but only one is active at a time), the hang won't occur. I was curious about...

No, I haven't. Maybe I'll try it these days.

Is there any update about this feature?