torchtune very wip metric logger improvements

Main changes: log on every step, accumulate metrics correctly over iterations, scrap log_memory_stats_every_n_steps and consolidate with existing log_every_n_steps.

Still need to test I didn't break anything. If we like this approach I can integrate into other recipes as well

PS: our wandb logger test was never running and is broken

Apr 22 '24 00:04 ebsmothers

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/831

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit d66919e3f46d15dada45cbf4548ce12c9bf07cb5 with merge base a46560ea428939a8f5d91c7b49f189ff1787da28 (): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Apr 22 '24 00:04 pytorch-bot[bot]

Hey, can I help you here? Looks similiar to what I was working: https://github.com/pytorch/torchtune/pull/730

Apr 22 '24 09:04 tcapelle

Hey, can I help you here? Looks similiar to what I was working: #730

@tcapelle thanks, yeah actually this started from trying to get the gradient accumulation test on #730 to pass and kinda expanded from there. If it's easiest for you, I am happy to just let you commandeer this PR so you don't have to go adding the changes to all the other recipes. Let me know what you'd prefer.

Apr 22 '24 14:04 ebsmothers

great work!

Apr 29 '24 10:04 tcapelle