Support for MS-AMP in FSDP
What would you like to be added: Support for MS-AMP in FSDP.
Why is this needed: This will help train large model with optimizer state sharding.
Thanks for your interest to our work!
We will support for MS-AMP in FSDP : )
@wkcn is there a timeline you guys are targeting for FSDP integration?
When applying FP8 to FSDP, there are 2 problems we need to solve: 1 FSDP requires that all parameters have same dtypes. If we only change some parameters to FP16/FP8, this rule will be broken. 2 Each fp8 tensor has a scaling factor. When updating parameter in optimizer, we need to synchronize scaling factor.