JackieWu
JackieWu
Thanks for your attention to our work! We use the codebase https://github.com/NVIDIA/Megatron-LM with the patch https://github.com/Azure/MS-AMP-Examples/blob/main/gpt3/Megatron-LM.patch. Sequence parallel is enabled but activation checkpointing is not used.
Hi @rationalism , thanks for your attention to our work! We did not implement DeepSpeed ZeRO 3 with MS-AMP support. ZeRO 1 and ZeRO 2 with MS-AMP support are available.
Hi @yatorho , PyTorch added a new assertion to check whether param is torch.Tensor, but ScalingTensor in MS-AMP is not torch.Tensor. A temporal solution is to comment the Line 256...
Thanks for your attention to our work! You can replace FP8Linear with torch.nn.Linear, where the weight and bias with ScalingTensor can be converted to torch.float32 by 'weight = weight.float(), bias...
Hi @jon-chuang , I am sorry for late reply. Thanks for your attention to our work! ### 1. Performance > The repo only mention training accuracy and memory savings. However,...
Thanks for your attention to our work! 1. The datatype of the first moment is fp8-e4m3, and that of the second one is fp16. They are both scaling tensors with...
> Thank you for your answer. So, why is the reason you define the first moment as uint8 datatype : > > https://github.com/Azure/MS-AMP/blob/0a2cd721fa68ee725e3b2fb132df02ddb8069d62/msamp/__init__.py#L81C9-L81C23 There is no native FP8 datatype in...
Hi @afcruzs , Thanks for your attention to our work! We need to organize the data.
@LSC527 Thank you for pointing the bug out! A temporary solution is to increase the reduce_bucket_size of zero_optimization in deepspeed config. It can avoid large tensor reduction. ```json "zero_optimization": {...
@LSC527 FP8 accelerates the training significantly when the model is relatively large (> 6B parameters). MS-AMP can reduce the memory usage to enable a larger batch size. And it can...