JackieWu

Results 88 comments of JackieWu

It also contains the master weight and the optimizer states. You can keep the value corresponding to the key `state_dict` only. ```python ckpt = torch.load(checkpoint_fname) new_ckpt = dict(state_dict=ckpt['state_dict']) torch.save(new_ckpt, saved_fname)...

Hi @gudrb , thanks for your attention to our work! It seems that the GPU operator is not built. Could you please try to rebuild the RPE operators? The environment...

> I solved this problem by installing a slightly different version of PyTorch. > > For CUDA 12.2, I used the following command: > > conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0...

@muellerzr Thanks for your contribution! The PR looks good to me. Sorry that I am not at Microsoft and do not have the authorization to review and merge the pull...

Hi @zigzagcai , thanks for your attention to our work! The FP8 tensor with a scaling factor is stored in a uint8 tensor and a FP32 scalar. Therefore, the FP8...

Hi @Zhong1015 , thank you for your continued support! : ) The definition you provided is correct. The concept of `bucket` comes from the hash algorithm. A relative position `(x1-x2,...

Hi @leedrake5 , thanks for your attention to our work! I could not reproduce the issue. It seems that the packages `msamp_arithmetic` and `msamp_adamw` are not copied into the `site-packages`...

@leedrake5 The custom NCCL library in MS-AMP is used to support all-reduce operations for FP8 weight gradients. If the custom NCCL is not installed, the FP8 all-reduce in Megatron Optimizer...