tutel bp of shared parameters and experts

The ddp in pytorch can not distinguish experts and other shared parameters. And experts may be updated with shared gradient. The TutelDistributedOptimizer seems to be an implementation of zero, which does not affect the graident. How does tutel deal with the problem?

Jun 14 '22 08:06 a157801

Yes, TutelDistributedOptimizer is a replacement of Pytorch DDP in that example (helloworld_ddp_tutel) to make the whole model sychronization transparent.

TutelDistributedOptimizer not only implements ZeRO optimization, but also leverages built-in mask (_tutel_expert) to distinguish whether a parameter is shared or from the creation of tutel.moe.moe_layer.

Note that TutelDistributedOptimizer only treats parameters created by tutel.moe.moe_layer to be expert parameters. If the model never uses tutel.moe.moe_layer, there is no difference with Pytorch DDP (expect TutelDistributedOptimizer includes ZeRO feature).

Jun 14 '22 10:06 ghostplant

Thank you for your answer. I notice that _tutel_expert flag is used to split the parameters. But it seems that gradient of experts with _tutel_expert will also be allreduced by ddp. The _tutel_expert flag indicates these parameters are experts and will not be splitted on different gpus, but does not controll the allreduce operation.

Jun 14 '22 10:06 a157801

To use TutelDistributedOptimizer which has parameter synchronization included, you should no longer warp the model with DistributedDataParallel.

Jun 15 '22 08:06 ghostplant

I notice the code in swin-transformer repo(https://github.com/microsoft/Swin-Transformer/blob/main/main_moe.py), which uses pytorch optimizer and ddp to train these moe models. Maybe there is something wrong. Thanks a lot.

Jun 17 '22 05:06 a157801

It is a version that manually distinguish parameter types, which follows helloworld_ddp.py

Jun 17 '22 06:06 ghostplant

Does it work by setting skip_allreduce as true in the scan function?

Jun 17 '22 07:06 a157801

To use tutel moe in Pytorch DDP backend, you need to not only set skip_allreduce as true in the moe scan function, but also recollect parameters with those masks, and tell DDP to skip synchronizing them by: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_ddp.py#L92. Otherwise, Pytorch DDP won't know they are expert parameters, so they'll be synchronized unexpectedly.

Jun 17 '22 08:06 ghostplant

tutel tutel copied to clipboard

bp of shared parameters and experts

tutel
tutel copied to clipboard