torchtune Improve training UX with TPS, GPU peak mem %, cudaMalloc retries, and do it in color for pizzaz

This PR addresses a tune pain point that AWS has complained about for some time...specifically, they want to be able to quickly see a runs tps, gpu peak mem %, and cudaMallocCache retries directly on the console so they can quickly adjust their batch size and optimize throughput. Some of this info exists already but it's not shown on the console and is thus available only by pulling up tensorboard or similar, vs seeing it direclty on the console is much faster for optimizing a run. In addition, here I express the peak reserved memory not as a number (i.e. 12 GiB) but as a percentage of the total gpu memory and one that is shown during training (at typical mem peak which is at the end of forward pass) rather than at the start before training as is currently being done.
Re: % of gpu peak mem - Users want actionable information, not raw data...i.e. you want to shoot for 85%-90% peak memory without retries as a baseline during training to optimizer throughput, so seeing the percentage directly and with retries being flagged in the console makes it much easier to optimize/tune their tune training (get it...tune their tune training ;) .

Thus: 1 - adds a GPUMemoryMonitor class that monitors mem stats and cudaMalloc retries and reports those 2 - displays GPU total memory at start 3 - integrates these into the tqdm display along with a bit of slight rounding to make things fit nicely. Note this only changes display re: rounding, does not affect the underlying metrics sent to tensorboard, etc. 4 - with those changes you get this: Screenshot 2024-08-16 at 5 48 25 PM

and finally to add some visual excitement.. 5 - adds color class and integrates color into the console display resulting in this: Screenshot 2024-08-16 at 6 05 50 PM

I'm posting this PR now so that AWS can use these changes. There's probably some discussion to be had as I see there is a couple simpler memory apis already built and maybe this should be consolidated etc. But happy to. discuss - in the interim will send this PR over for immediate use.

Context

What is the purpose of this PR? Is it to

[ X] add a new feature
[ ] fix a bug
[ ] update tests and/or documentation
[ ] other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.)

[ ] run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
[ ] add unit tests for any new functionality
[ ] update docstrings for any new or updated methods or classes
[ ] run unit tests via pytest tests
[ ] run recipe tests via pytest tests -m integration_test
[ ] manually run any new or modified recipes with sufficient proof of correctness
[ ] include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it. Example of docstring: https://github.com/pytorch/torchtune/blob/6a7951f1cdd0b56a9746ef5935106989415f50e3/torchtune/modules/vision_transformer.py#L285 Example in our docs: https://pytorch.org/torchtune/main/tutorials/qat_finetune.html#applying-qat-to-llama3-models

[ ] I did not change any public API;
[ ] I have added an example to docs or docstrings;

Aug 17 '24 01:08 lessw2020

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1360

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit 4303704dfde066ac1e1ed8cd34eafa5f524b28c7 with merge base 367e9abda31b5a25805fdb9db40ed8952fd15103 (): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Aug 17 '24 01:08 pytorch-bot[bot]

@lessw2020 thanks so much for making these changes! I think our logging to console can definitely be better and these metrics generally make a lot of sense to me. Thanks!

I'll let you answer some of @felipemello1 questions and also tag @ebsmothers who's done a ton of work on logging in the past. Will do.

My primary suggestion would be around the design of this component and how we want to expose this in each recipe. It seems like the current method has a ton of functionality which will need to be copied over in each recipe and this runs the risk of fragmenting the logic as more changes are made and of course bugs resulting from duplication.

100% agree - that's exactly what I meant about the need to discuss better integration overall...this initial PR was to show it working and get something to AWS asap for their immediate use case (single gpu lora), and then work on a better integration assuming you guys liked the concept overall.

I'd suggest designing a stateful class that we can initialize during setup of the recipe, which tracks all of the information needed as part of its state and exposes some APIs which can return the relevant information to the train method. I think it's ok to expect the recipe to do the logging of the actual metrics. I think this class can also run some aggregates (eg: max, avg etc) which we can log at the end of training to give some more insights. A few benefits of this approach: Agree - this is very similar to how we implemented things in titan btw.

The logic is consolidated in a single class and so all issues like distributed vs single GPU and interaction with other tooling can be addresses in one place

Any updates to logic can be automatically shared across recipes rather than copy pasting this again and again

Does this make sense?

Yes seems ideal from a design standpoint.

Aug 19 '24 17:08 lessw2020

torchtune torchtune copied to clipboard

Improve training UX with TPS, GPU peak mem %, cudaMalloc retries, and do it in color for pizzaz

Context

Changelog

Test plan

UX

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1360

:white_check_mark: No Failures

torchtune
torchtune copied to clipboard