torchtune [WIP] Add a Perf Monitor for metric tracking.

Context

In this PR, we introduce TunePerfMonitor, a utility class for tracking metrics across training. This class is meant to be flexible in the actual metrics that users can track, metrics will be defined and tracked by user (see example in recipe).
Please see "LIMITATIONS" section in the code for limitations of this tracker in its current state.

Changelog

...

Test plan

Unittests
In full finetune single device, I've integrated a few basic metrics - average seconds per iter and max memory allocated post backward pass, and surface them to wandB. Charts look as follows:

Mar 28 '24 20:03 rohan-varma

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/608

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:x: 9 New Failures

As of commit 69d86fb8d3a3b9f2bb015fcef6c78fb59289f6da with merge base aacaadd38820f95be90339b92bbe14c66ea27e02 ():

NEW FAILURES - The following jobs have failed:

Lint / lint (3.10) (gh) torchtune/utils/perf_utils.py:12:1: F401 'torch' imported but unused
Multi-GPU Recipe Tests / recipe_test_multi_gpu (3.10) (gh)
Multi-GPU Recipe Tests / recipe_test_multi_gpu (3.11) (gh) tests/recipes/test_full_finetune_single_device.py::TestFullFinetuneSingleDeviceGradientAccumulation::test_gradient_accumulation
Multi-GPU Recipe Tests / recipe_test_multi_gpu (3.8) (gh) ##[error]The operation was canceled.
Multi-GPU Recipe Tests / recipe_test_multi_gpu (3.9) (gh) tests/recipes/test_full_finetune_single_device.py::TestFullFinetuneSingleDeviceGradientAccumulation::test_gradient_accumulation
Recipe Tests / recipe_test (3.10) (gh) ##[error]The operation was canceled.
Recipe Tests / recipe_test (3.11) (gh) ##[error]The operation was canceled.
Recipe Tests / recipe_test (3.8) (gh) tests/recipes/test_full_finetune_single_device.py::TestFullFinetuneSingleDeviceGradientAccumulation::test_gradient_accumulation
Recipe Tests / recipe_test (3.9) (gh) tests/recipes/test_full_finetune_single_device.py::TestFullFinetuneSingleDeviceGradientAccumulation::test_gradient_accumulation

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Mar 28 '24 20:03 pytorch-bot[bot]

Thanks for making this RFC PR @rohan-varma! Share my 2c

Seems there are many functions that support different tracking use cases. Shall we begin from less generic class design and focus on the priority tracking need we have? (such as QPS, memory stats etc)
Based on the discussion in https://github.com/pytorch/torchtune/pull/604. I'm consider 2 tracking cases here
- Necessary in almost every runs with minimum performance overhead (such as training QPS, memory stats etc). For this case, we can use the perfMonitor class as you proposed in this PR
- Not necessary in every run and only needed when debugging and the tracing will have perf overhead such as torch profiler, memory snapshot (https://pytorch.org/blog/understanding-gpu-memory-1/). We can consider make them as component and bring them into recipe when necessary. I think a better state of this it to enable them by 'debug' flag or calling them during OOM

cc: @kartikayk

Mar 28 '24 21:03 SLR722

@SLR722 It makes sense to me to make the profiler and memory snapshot as individual standalone components, though I don't see why these should be together within the same component when they are quite different, have different APIs and are meant to debug different issues. What issues come up if we build these as two separate components that don't interfere with each other and can be enabled one, the other, or both?

Mar 29 '24 04:03 rohan-varma

@SLR722 It makes sense to me to make the profiler and memory snapshot as individual standalone components, though I don't see why these should be together within the same component when they are quite different, have different APIs and are meant to debug different issues. What issues come up if we build these as two separate components that don't interfere with each other and can be enabled one, the other, or both?

Agree that we should build 2 separate components for profiler and memory snapshot.

Let's consolidate the discussion from https://github.com/pytorch/torchtune/pull/604, shall we align on this design?

Have 2 separate components for pytorch profiler and memory snapshot. User can use to plug in for debugging
Have 1 perfTracker class for general tracking that needed for every runs, such as QPS, memory status etc

cc: @rohan-varma @kartikayk

Mar 29 '24 08:03 SLR722

@SLR722 Seems reasonable to me. I don't fully see the need to add all of these into a singular component, but do see value in terms of having a single entrypoint to manage all performance related things. Would like thoughts from @kartikayk and @ebsmothers

Mar 29 '24 10:03 rohan-varma

Sure, we don't need to add all of these into a single component if it doesn't make sense to. But I'd like to preserve the design principle around having self-contained components which can be pulled into any recipe. Make as many components as we need (within reason).

Mar 29 '24 16:03 kartikayk

Thanks @skcoirz @RdoubleA for your comments on the PR, sorry that the integration into the recipe is currently in a messy state. Currently mostly looking for feedback on the API and class itself, but realize this will be easier to contextualize with a clearer example of it in the recipe. Adding that now

Apr 01 '24 19:04 rohan-varma

LGTM! Please fix the tests and clean up the code and then we are good to go!

Apr 11 '24 02:04 SLR722

torchtune torchtune copied to clipboard

[WIP] Add a Perf Monitor for metric tracking.

Context

Changelog

Test plan

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/608

:x: 9 New Failures

torchtune
torchtune copied to clipboard