ghostplant
ghostplant
Can you upgrade and keep NCCL version the same on all environments? Most of NCCL timeout issues were from libnccl legacy bugs, or inconsistent NCCL version problems. You can also...
OK, since `tutel.examples.helloworld` works well, it should be related to inequivalent data sources stored on each GPU, which results in different planned iteration counts locally and thus triggers different number...
Since `inequivalent_tokens=True` works, it means there is no issue from "inequivalent forwarding counts". (See Case-1) It is only helpful when for each iteration, the "tokens per batch" on each device...
I think [SWIN Transformer] (https://github.com/microsoft/Swin-Transformer) would provide such pretrained MoE model based on Tutel. For other language models over Fairseq, this repo currently only provides [scripts](https://github.com/microsoft/tutel/tree/main/tutel/examples/fairseq_moe) that train models from...
You can use general way to save & reload models that are in any kind of distributed settings, with each peer program holding one slice of checkpoint files. https://github.com/microsoft/tutel/pull/88/files
Do you skip allreduce on expert parameters? If not, their value will become the same.
Is it related to moe_layer? Does this suggestion help? https://github.com/pytorch/pytorch/issues/22436
It should be related to the usage of torch.nn.parallel.DistributedDataParallel. If you try `helloworld_ddp.py` with multiple gates, as the model also contains parameters that don't contribute to loss, it is verified...
Is your environment Windows OS?
On Linux, you can install Tutel from source using conda's `python ./setup.py` command. On Windows, current Tutel isn't support that. We'll soon do an investigation.