How to implement Fairseq-MoE training checkpoint like Swin-MoE?

Open withinmiaov opened this issue 2 years ago • 1 comments

First, I want to thank the tutel team for open-sourcing this work, it's a very good and practical framework. I want to use tutel's moe in fairseq nlp tasks, but I encountered a problem, the original checkpoint setting of fairseq can't save and load Experts parameters distributed on different GPUs. How should I modify the fairseq model to support checkpoints like Swin-moe?

Nov 10 '23 07:11 withinmiaov

Hi, you may need to rename the save_dir to make per-device process save to a unique destination:

https://github.com/facebookresearch/fairseq/blob/da8fb630880d529ab47e53381c30ddc8ad235216/fairseq/dataclass/configs.py#L645

You can change the default save_dir path to: f"checkpoints-dev{os.environ.get('LOCAL_RANK', 0)}" or f"checkpoints-dev{os.environ.get('RANK', 0)}"

Nov 12 '23 05:11 ghostplant