ghostplant comments

Results 275 comments of


                                            ghostplant

NCCL Asynchronous update timeout crash with Tutel MoE

Can you upgrade and keep NCCL version the same on all environments? Most of NCCL timeout issues were from libnccl legacy bugs, or inconsistent NCCL version problems. You can also...

NCCL Asynchronous update timeout crash with Tutel MoE

OK, since `tutel.examples.helloworld` works well, it should be related to inequivalent data sources stored on each GPU, which results in different planned iteration counts locally and thus triggers different number...

NCCL Asynchronous update timeout crash with Tutel MoE

Since `inequivalent_tokens=True` works, it means there is no issue from "inequivalent forwarding counts". (See Case-1) It is only helpful when for each iteration, the "tokens per batch" on each device...

Pretrained MoE model

I think [SWIN Transformer] (https://github.com/microsoft/Swin-Transformer) would provide such pretrained MoE model based on Tutel. For other language models over Fairseq, this repo currently only provides [scripts](https://github.com/microsoft/tutel/tree/main/tutel/examples/fairseq_moe) that train models from...

how to save checkpoint when use data parallel and moe expert

You can use general way to save & reload models that are in any kind of distributed settings, with each peer program holding one slice of checkpoint files. https://github.com/microsoft/tutel/pull/88/files

how to save checkpoint when use data parallel and moe expert

Do you skip allreduce on expert parameters? If not, their value will become the same.

how to save checkpoint when use data parallel and moe expert

Is it related to moe_layer? Does this suggestion help? https://github.com/pytorch/pytorch/issues/22436

how to save checkpoint when use data parallel and moe expert

It should be related to the usage of torch.nn.parallel.DistributedDataParallel. If you try `helloworld_ddp.py` with multiple gates, as the model also contains parameters that don't contribute to loss, it is verified...

how can I install this pack on conda environment??

Is your environment Windows OS?

how can I install this pack on conda environment??

On Linux, you can install Tutel from source using conda's `python ./setup.py` command. On Windows, current Tutel isn't support that. We'll soon do an investigation.