ghostplant comments

Results 272 comments of


                                            ghostplant

Example on saving experts to one model when using distributed training

A dup request of #177. We are going to add some utility functions to help with this conversion.

My code seems to hang when skip_remainder_batch=False.

Thanks for your information. This is a dup of #173. We'll update the fairseq patch to add `inequivalent_tokens=True` which is recently in tutel but not in fairseq patch. You may...

My code seems to hang when skip_remainder_batch=False.

That's interesting. If it's true that one GPU performs 5 forwards and another GPU performs 6 forwards, does traditional data parallel even work? I think the application itself has to...

My code seems to hang when skip_remainder_batch=False.

OK, this root cause makes sense. MoE layer within one process does not know application's purpose on whether other processes are going to forward MoE together with it or not....

Error in load_importance_loss

@zeliu98 We need to add assertion reason to avoid unknowns error like this. And thanks for your information! @Luodian

Error in load_importance_loss

We have added `gate_noise` assertion and device cast in latest commit. Thanks for pointing out this bug.

Cannot import JIT optimized kernels. Did you forget to install Custom Kernel Extension?

It is usually due to environment issues (e.g. **improper CXX compiler, CUDA dependencies, or NCCL dependencies**) which make Tutel installation only enable CPU support, e.g. `python3 -m tutel.examples.helloworld --device cpu`....

Tutel with pytorch automatic mixed precision package

Yes, Tutel is able to support native Pytorch AMP. Please follow this example: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_amp.py#L76 Note that you need to use ` @autocast()` and `with autocast:` properly according to pytorch's doc....

Error when doing deepcopy of the model

If you pickle the model for single GPU, anything will be fine because AllToAll is not included in Tutel's MoE layer. Is that okay to your expectation? Pytorch's NCCL operations...