torchtune
torchtune copied to clipboard
[WIP] Add a Perf Monitor for metric tracking.
Context
- In this PR, we introduce
TunePerfMonitor, a utility class for tracking metrics across training. This class is meant to be flexible in the actual metrics that users can track, metrics will be defined and tracked by user (see example in recipe). - Please see "LIMITATIONS" section in the code for limitations of this tracker in its current state.
Changelog
- ...
Test plan
- Unittests
- In full finetune single device, I've integrated a few basic metrics - average seconds per iter and max memory allocated post backward pass, and surface them to wandB. Charts look as follows:
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/608
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:x: 9 New Failures
As of commit 69d86fb8d3a3b9f2bb015fcef6c78fb59289f6da with merge base aacaadd38820f95be90339b92bbe14c66ea27e02 ():
NEW FAILURES - The following jobs have failed:
- Lint / lint (3.10) (gh)
torchtune/utils/perf_utils.py:12:1: F401 'torch' imported but unused - Multi-GPU Recipe Tests / recipe_test_multi_gpu (3.10) (gh)
- Multi-GPU Recipe Tests / recipe_test_multi_gpu (3.11) (gh)
tests/recipes/test_full_finetune_single_device.py::TestFullFinetuneSingleDeviceGradientAccumulation::test_gradient_accumulation - Multi-GPU Recipe Tests / recipe_test_multi_gpu (3.8) (gh)
##[error]The operation was canceled. - Multi-GPU Recipe Tests / recipe_test_multi_gpu (3.9) (gh)
tests/recipes/test_full_finetune_single_device.py::TestFullFinetuneSingleDeviceGradientAccumulation::test_gradient_accumulation - Recipe Tests / recipe_test (3.10) (gh)
##[error]The operation was canceled. - Recipe Tests / recipe_test (3.11) (gh)
##[error]The operation was canceled. - Recipe Tests / recipe_test (3.8) (gh)
tests/recipes/test_full_finetune_single_device.py::TestFullFinetuneSingleDeviceGradientAccumulation::test_gradient_accumulation - Recipe Tests / recipe_test (3.9) (gh)
tests/recipes/test_full_finetune_single_device.py::TestFullFinetuneSingleDeviceGradientAccumulation::test_gradient_accumulation
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Thanks for making this RFC PR @rohan-varma! Share my 2c
- Seems there are many functions that support different tracking use cases. Shall we begin from less generic class design and focus on the priority tracking need we have? (such as QPS, memory stats etc)
- Based on the discussion in https://github.com/pytorch/torchtune/pull/604. I'm consider 2 tracking cases here
- Necessary in almost every runs with minimum performance overhead (such as training QPS, memory stats etc). For this case, we can use the perfMonitor class as you proposed in this PR
- Not necessary in every run and only needed when debugging and the tracing will have perf overhead such as torch profiler, memory snapshot (https://pytorch.org/blog/understanding-gpu-memory-1/). We can consider make them as component and bring them into recipe when necessary. I think a better state of this it to enable them by 'debug' flag or calling them during OOM
cc: @kartikayk
@SLR722 It makes sense to me to make the profiler and memory snapshot as individual standalone components, though I don't see why these should be together within the same component when they are quite different, have different APIs and are meant to debug different issues. What issues come up if we build these as two separate components that don't interfere with each other and can be enabled one, the other, or both?
@SLR722 It makes sense to me to make the profiler and memory snapshot as individual standalone components, though I don't see why these should be together within the same component when they are quite different, have different APIs and are meant to debug different issues. What issues come up if we build these as two separate components that don't interfere with each other and can be enabled one, the other, or both?
Agree that we should build 2 separate components for profiler and memory snapshot.
Let's consolidate the discussion from https://github.com/pytorch/torchtune/pull/604, shall we align on this design?
- Have 2 separate components for pytorch profiler and memory snapshot. User can use to plug in for debugging
- Have 1 perfTracker class for general tracking that needed for every runs, such as QPS, memory status etc
cc: @rohan-varma @kartikayk
@SLR722 Seems reasonable to me. I don't fully see the need to add all of these into a singular component, but do see value in terms of having a single entrypoint to manage all performance related things. Would like thoughts from @kartikayk and @ebsmothers
Sure, we don't need to add all of these into a single component if it doesn't make sense to. But I'd like to preserve the design principle around having self-contained components which can be pulled into any recipe. Make as many components as we need (within reason).
Thanks @skcoirz @RdoubleA for your comments on the PR, sorry that the integration into the recipe is currently in a messy state. Currently mostly looking for feedback on the API and class itself, but realize this will be easier to contextualize with a clearer example of it in the recipe. Adding that now
LGTM! Please fix the tests and clean up the code and then we are good to go!