Multi-nodes training is much more slower than single node

Open YingqingHe opened this issue 3 years ago • 1 comments

hi, when I train models using tutel, I find that, in each step, multi-nodes training will need much more step time (if n nodes, it will take around n times of training time of 1-node) than single node training. Thus multi-node training will take even more time than single-node training to finish one epoch. Any debugging suggestions with this issue? Thanks!!!

Sep 29 '22 14:09 YingqingHe

Hi, thanks for reporting this issue.

For low-equipped distributed environment (e.g. eithernet with low-end busbw), cross-node All2All is supposed to have a significant bandwidth utilization drop against single-node training as the communication is fully over NVlink, unless you have high-end infini-band. This issue https://github.com/microsoft/tutel/issues/160 discusses the detail of what busbw is required to achieve corresponding training throughput.

A good thing is that even though you see a throughput drop after first scaling to multiple nodes, further increasing nodes no longer makes it worse significantly.

In addition, for a few scenarios, you can set --parallel_type=adaptive:0 which won't perform All2All for training, then see whether the step time becomes a little better.

Sep 30 '22 03:09 ghostplant