JackieWu comments

Results 88 comments of


                                            JackieWu

Is activation checkpointing used for Table 5 from the FP8-LM paper?

Thanks for your attention to our work! We use the codebase https://github.com/NVIDIA/Megatron-LM with the patch https://github.com/Azure/MS-AMP-Examples/blob/main/gpt3/Megatron-LM.patch. Sequence parallel is enabled but activation checkpointing is not used.

MS-AMP crashes with DeepSpeed ZeRO 3

Hi @rationalism , thanks for your attention to our work! We did not implement DeepSpeed ZeRO 3 with MS-AMP support. ZeRO 1 and ZeRO 2 with MS-AMP support are available.

MNIST single GPU example: GradScaler AssertionError

Hi @yatorho , PyTorch added a new assertion to check whether param is torch.Tensor, but ScalingTensor in MS-AMP is not torch.Tensor. A temporal solution is to comment the Line 256...

how can i export the model from pytorch to onnx?

Thanks for your attention to our work! You can replace FP8Linear with torch.nn.Linear, where the weight and bias with ScalingTensor can be converted to torch.float32 by 'weight = weight.float(), bias...

Questions: Clarifying the use of FP8 for Training

Hi @jon-chuang , I am sorry for late reply. Thanks for your attention to our work! ### 1. Performance > The repo only mention training accuracy and memory savings. However,...

Optimizer datatype

Thanks for your attention to our work! 1. The datatype of the first moment is fp8-e4m3, and that of the second one is fp16. They are both scaling tensors with...

Optimizer datatype

> Thank you for your answer. So, why is the reason you define the first moment as uint8 datatype : > > https://github.com/Azure/MS-AMP/blob/0a2cd721fa68ee725e3b2fb132df02ddb8069d62/msamp/__init__.py#L81C9-L81C23 There is no native FP8 datatype in...

Training curve datapoints or smoothing

Hi @afcruzs , Thanks for your attention to our work! We need to organize the data.

AttributeError: 'ScalingTensor' object has no attribute 'view'

@LSC527 Thank you for pointing the bug out! A temporary solution is to increase the reduce_bucket_size of zero_optimization in deepspeed config. It can avoid large tensor reduction. ```json "zero_optimization": {...

AttributeError: 'ScalingTensor' object has no attribute 'view'

@LSC527 FP8 accelerates the training significantly when the model is relatively large (> 6B parameters). MS-AMP can reduce the memory usage to enable a larger batch size. And it can...