Megatron-LM
Megatron-LM copied to clipboard
optional use of mpi instead of gloo for distributed checkpoint load/save
We've been transiently seeing the error [E ProcessGroupGloo.cpp:144] Gloo connectFullMesh failed with [/opt/pytorch/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:144]
when running at scales of 10k ranks or more.
(The error seems to happen with increasing rate as the number of ranks is increased. Fewer than one failure in 20 runs at 10240 gpus, about one failure in 3 runs at 10992 gpus.)
This PR enables enables an envvar (CPU_COMMS_BACKEND_OVERRIDE) to override Megatron Core's backend for cpu-comms from the default gloo
to the option mpi
.
I've tested the new option extensively on the MLPerf benchmark at several scales.
I've tested that when CPU_COMMS_BACKEND_OVERRIDE is unset (or set to 'gloo') that the code behaves as it did before, both functionally and from a perf perspective.
I've tested that setting CPU_COMMS_BACKEND_OVERRIDE=BREAKME (a nonsense value) leads to a failure. (So the envvar is actually changing the backend).
I've tested that when CPU_COMMS_BACKEND_OVERRIDE=mpi, the Gloo connectFullMesh error "goes away" at very large scale. (In my very first and only run today at 1380 nodes with CPU_COMMS_BACKEND_OVERRIDE unset I got the Gloo connectFullMesh error, then I set CPU_COMMS_BACKEND_OVERRIDE=mpi and did 5 runs in a row that succeeded in loading the checkpoint without any error.