duanjunwen
duanjunwen
1. add DistributedAdafactor to ”./colossalai/nn/optimizer/distributed_adafactor.py“; Support for parameter input formats: RowParallel + Zero2, ColParallel + Zero2; 2. add TestCase to "./tests/test_optimizer/test_distributred_adafactor_optim.py"
1.Update MoeHybridParallelPlugin; 2.Use MoeHybridParallelPlugin to replace MoEManager; 3.Remove dependency on MoEManager of test_moe_checkpoint.py;
Hi @apachemycat , would you mind sharing the version of flash_atten in your environment? I am using flash-attn==2.5.7 , looks all good. Also, you can replace dropout_layer_norm with torch.nn.functional.layer_norm &...