Shubhanjan Shekhar

Results 3 comments of Shubhanjan Shekhar

> not sure if this was mentioned anywhere, but this PR **breaks training checkpoint saving** because > > 1. the grad norm is added to `TrainerState.log_history` **as a `tensor`** >...

> Can you all try installing with `pip install git+https://github.com/huggingface/transformers@muellerzr-deepspeed-item`? That fixed it for me! Thanks a lot

@ichsan2895 your solution above worked. Do you know how we can extend it to multi-node multi-gpu?