Updated example_train.py hangs on CPU training

Open Heasummn opened this issue 1 year ago • 1 comments

Using the new example_train.py and running it with torchrun --nproc-per-node 3 example_train.py results in the example hanging when using CPU devices. I have been able to reproduce this on Windows and Mac OS, running x86 and M2 architectures respectively. I'm not sure how to get output logs, as canceling the training just gives a backtrace for the runner.

Mar 24 '24 04:03 Heasummn

Hi, thanks for reporting it. That's a known issue. The CPU communication backend (Gloo) does not have a good support for batch_isend_irecv, which we recently move to to communication multiple tensors in "one shot" between ranks. If you use GPU, this issue should go away.

Cc @H-Huang can we harden our support for non-GPU devices?

Mar 25 '24 01:03 kwen2501