Zhi-Kai Xu
Zhi-Kai Xu
@jomayeri Sure. For the setting of 4 A100s, they have NVLink interconnecting them. But no matter if NCCL_P2P_DISABLE=1 or not, the hanging always occur.  Here is another issue. After...
Thank for your reply! @jomayeri If I run training without DeepSpeed (use 4 V100 but only one is active at a time), the hang won't occur. I was curious about...
No, I haven't. Maybe I'll try it these days.
Is there any update about this feature?