Shubhanjan Shekhar
Results
3
comments of
Shubhanjan Shekhar
> not sure if this was mentioned anywhere, but this PR **breaks training checkpoint saving** because > > 1. the grad norm is added to `TrainerState.log_history` **as a `tensor`** >...
> Can you all try installing with `pip install git+https://github.com/huggingface/transformers@muellerzr-deepspeed-item`? That fixed it for me! Thanks a lot
@ichsan2895 your solution above worked. Do you know how we can extend it to multi-node multi-gpu?