torchtune Compute grad norm

Added the Grad norm function that original was added in the improved logging experience.

It may be moved somewhere else, but I think it's a really relevant metric to have when training.

Apr 29 '24 10:04 tcapelle

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/897

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Apr 29 '24 10:04 pytorch-bot[bot]

We can also use torch.nn.utils.clip_grad_norm_ instead of manually calculating the norms. : https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html

Apr 29 '24 18:04 musabgultekin

Apr 29 '24 18:04 tcapelle

We should also add this parameter to the recipes YAMLs, default should be 1.0 as in HF and axolotl.

Apr 29 '24 18:04 tcapelle

I need to add the:

max_norm: 1.0

to the recipes, do you have any trick to do this automatically?

Apr 29 '24 18:04 tcapelle

I need to add the:
max_norm: 1.0
to the recipes, do you have any trick to do this automatically?

@tcapelle see my other comment. Unless I'm misunderstanding I don't think we want to use clip_grad_norm here after all. Regardless you raise a good point.. I've also found it a bit annoying to manually modify/add a field in all our configs. I think we should try to add a tool for this at some point (but to actually answer your question, no such tool exists as of yet).

Apr 29 '24 20:04 ebsmothers

We can use float('inf') instead of 1? So it doesn't clip

Apr 30 '24 06:04 musabgultekin

I feel that we should have grad_clip enabled by default. The idea is give a good finetune recipe in place, also the grad_norm is a good debugging tool. This is a good example of grad norm utility, it enables you to debug and even analyse before doing an optimizer step so we can avoid loss spikes.

Apr 30 '24 10:04 tcapelle

While adding support for gradient clipping as a feature is nice to have, I don’t think we should conflate it with what’s being proposed here, which is a logging change. I definitely do not think we should enable gradient clipping by default without testing the implication of such a change on our various recipes.

As I mentioned above, I do see the value in logging grad norm. And evidently clip_grad_norm is a reasonable way to do it (provided that we pass inf as suggested by @musabgultekin). However, there is a cost to this change: we are calling a method clearly intended to clip gradients and using it in a non-obvious way for logging in the recipe. In my opinion, one of the top considerations for our recipes is that they are easily understandable, and I think such a change harms that a bit. So if efficiency of implementations are roughly equivalent I’d actually prefer a separate utility (assuming we are not adding proper gradient clipping support, which again, should be addressed separately and a bit more carefully imo).

Apr 30 '24 13:04 ebsmothers

Agree with everything above. I think we should wait and test if max_grad_norm should be used on the recipes as default or not. I can change it now to float(inf) so it does not impact the tests. What I can say, is that I have always seen this parameter being used for LLM finetuning. Some examples are:

The HF.Trainer has a default of 1.0
axolotl has a default of 1.0
the mistral reference finetuning script defaults of 1.0,
lit-gpt defaults to 1.0
nanoGPT defaults to 1.0

I can pull data on W&B runs and check when it is being changed to other than 1.0 on the integrations we already have.

Apr 30 '24 13:04 tcapelle

covered by #1451

Sep 06 '24 19:09 RdoubleA

torchtune torchtune copied to clipboard

Compute grad norm

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/897

torchtune
torchtune copied to clipboard