torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

very wip metric logger improvements

Open ebsmothers opened this issue 1 year ago • 3 comments

Main changes: log on every step, accumulate metrics correctly over iterations, scrap log_memory_stats_every_n_steps and consolidate with existing log_every_n_steps.

Still need to test I didn't break anything. If we like this approach I can integrate into other recipes as well

PS: our wandb logger test was never running and is broken

ebsmothers avatar Apr 22 '24 00:04 ebsmothers

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/831

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit d66919e3f46d15dada45cbf4548ce12c9bf07cb5 with merge base a46560ea428939a8f5d91c7b49f189ff1787da28 (image): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Apr 22 '24 00:04 pytorch-bot[bot]

Hey, can I help you here? Looks similiar to what I was working: https://github.com/pytorch/torchtune/pull/730

tcapelle avatar Apr 22 '24 09:04 tcapelle

Hey, can I help you here? Looks similiar to what I was working: #730

@tcapelle thanks, yeah actually this started from trying to get the gradient accumulation test on #730 to pass and kinda expanded from there. If it's easiest for you, I am happy to just let you commandeer this PR so you don't have to go adding the changes to all the other recipes. Let me know what you'd prefer.

ebsmothers avatar Apr 22 '24 14:04 ebsmothers

great work!

tcapelle avatar Apr 29 '24 10:04 tcapelle