Shubhanjan Shekhar comments

Repositories
Issues
Comments

Results 3 comments of


                                            Shubhanjan Shekhar

storing & logging gradient norm in trainer

> not sure if this was mentioned anywhere, but this PR **breaks training checkpoint saving** because > > 1. the grad norm is added to `TrainerState.log_history` **as a `tensor`** >...

storing & logging gradient norm in trainer

> Can you all try installing with `pip install git+https://github.com/huggingface/transformers@muellerzr-deepspeed-item`? That fixed it for me! Thanks a lot

Multi-gpu training example?

@ichsan2895 your solution above worked. Do you know how we can extend it to multi-node multi-gpu?