torchtune
torchtune copied to clipboard
Adds validation loss to LoRA fine tune single device
Context
What is the purpose of this PR? Is it to
- [x] add a new feature
- [ ] fix a bug
- [ ] update tests and/or documentation
- [ ] other (please add here)
Please link to any issues this PR addresses. https://github.com/pytorch/torchtune/issues/1042
Changelog
What are the changes made in this PR? Adds support to a validation dataset and computes the loss on it after each epoch.
Test plan
Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.
- [x] run pre-commit hooks and linters (make sure you've first installed via
pre-commit install) - [ ] add unit tests for any new functionality
- [ ] update docstrings for any new or updated methods or classes
- [x] run unit tests via
pytest tests - [x] run recipe tests via
pytest tests -m integration_test - [x] manually run any new or modified recipes with sufficient proof of correctness
- [ ] include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)
tune run lora_finetune_single_device --config llama3_2/1B_lora_single_device
UX
If your function changed a public API, please add a dummy example of what the user experience will look like when calling it. Here is a docstring example and a tutorial example
- [ ] I did not change any public API
- [ ] I have added an example to docs or docstrings
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2238
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Hi @MaxFrax!
Thank you for your pull request and welcome to our community.
Action Required
In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!
@felipemello1 Finally I have been able to work on this. I'll make my way through the testing plan, but feel free to share any comment you might already have.
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!
hey @MaxFrax , thank you! I am on PTO this week. I will get to it next week if someone doesnt do it before me.
Hi @ebsmothers ! I have updated the pr with the following edits, as per your recommendation:
- Created a stand alone method for the validation loop
- Parameter
run_val_every_n_stepsto invoke validate in specific points of the training epoch - I also added
max_validation_batchesto cap the amount of batches run in each validation step
If there's any other feedback or comment, just let me know!
Thanks for making the changes! I will take a look at this PR later today.
maybe we could at least update the lora single device configs to expose these fields so users know that it exists and we can check if the cfg field is None directly?
Lets do this as a follow up. I can use my script to bulk update. But lets make sure that we all agree on how it should like in the config.
Codecov Report
Attention: Patch coverage is 0% with 28 lines in your changes missing coverage. Please review.
Project coverage is 23.93%. Comparing base (
213f386) to head (df8cd1e). Report is 18 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| recipes/lora_finetune_single_device.py | 0.00% | 28 Missing :warning: |
:exclamation: There is a different number of reports uploaded between BASE (213f386) and HEAD (df8cd1e). Click for more details.
HEAD has 6 uploads less than BASE
Flag BASE (213f386) HEAD (df8cd1e) 9 3
Additional details and impacted files
@@ Coverage Diff @@
## main #2238 +/- ##
===========================================
- Coverage 65.41% 23.93% -41.49%
===========================================
Files 344 357 +13
Lines 20658 21153 +495
===========================================
- Hits 13514 5062 -8452
- Misses 7144 16091 +8947
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks @felipemello1 ! Some help on the testing side would be much appreciated. When you say:
- An example of how the config should look like. The UI should play a big factor on this PR. what do you exactly mean?
Should I provide a recipe using the validation dataset? Are we talking about the docs? Let me know more precisely what I should do, and I'll be happy to look into it.
@MaxFrax
Should I provide a recipe using the validation dataset? Are we talking about the docs? Let me know more precisely what I should do, and I'll be happy to look into it.
Your PR only contains changes to the recipe. I would encourage you to:
- make changes to one of the configs too, to illustrate how users would use it
- Put the command to launch this config in the description of the PR, under the testing section
- share an image of the logs generated in weights and biases under the testing section
@MaxFrax
Should I provide a recipe using the validation dataset? Are we talking about the docs? Let me know more precisely what I should do, and I'll be happy to look into it.
Your PR only contains changes to the recipe. I would encourage you to:
- make changes to one of the configs too, to illustrate how users would use it
- Put the command to launch this config in the description of the PR, under the testing section
- share an image of the logs generated in weights and biases under the testing section
Thanks for the feedback!
I have edited the llama3_2/1B_lora_single_device config and uploaded the W&B screen (I capped the training steps and changed batch sizes to make the screenshot instead of running exactly the committed recipe). I’ve resolved all the issues raised in the PR review, except for the model evaluation mode and the validation memory logging.
I'd like to check on the status of this PR - our team has been testing #2464 and it has been very useful but we make heavy use of single device recipes. Will this move forward or would an implementation of #2464 for the single device recipes be the goal?
hey @niznik-dev , thanks for the comment. @MaxFrax did a fantastic job here, but a few follow ups were missing. The https://github.com/pytorch/torchtune/pull/2464 was put up and these last few gaps were addressed, so we landed it.
I believe that @krammnic was going to work on bringing it to other recipes as well. For now, if you want, you can go ahead with how it was implemented in https://github.com/pytorch/torchtune/pull/2464