torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Adds validation loss to LoRA fine tune single device

Open MaxFrax opened this issue 11 months ago • 14 comments

Context

What is the purpose of this PR? Is it to

  • [x] add a new feature
  • [ ] fix a bug
  • [ ] update tests and/or documentation
  • [ ] other (please add here)

Please link to any issues this PR addresses. https://github.com/pytorch/torchtune/issues/1042

Changelog

What are the changes made in this PR? Adds support to a validation dataset and computes the loss on it after each epoch.

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

  • [x] run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • [ ] add unit tests for any new functionality
  • [ ] update docstrings for any new or updated methods or classes
  • [x] run unit tests via pytest tests
  • [x] run recipe tests via pytest tests -m integration_test
  • [x] manually run any new or modified recipes with sufficient proof of correctness
  • [ ] include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

tune run lora_finetune_single_device --config llama3_2/1B_lora_single_device

image

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it. Here is a docstring example and a tutorial example

  • [ ] I did not change any public API
  • [ ] I have added an example to docs or docstrings

MaxFrax avatar Jan 08 '25 14:01 MaxFrax

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2238

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Jan 08 '25 14:01 pytorch-bot[bot]

Hi @MaxFrax!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot avatar Jan 08 '25 14:01 facebook-github-bot

@felipemello1 Finally I have been able to work on this. I'll make my way through the testing plan, but feel free to share any comment you might already have.

MaxFrax avatar Jan 08 '25 14:01 MaxFrax

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

facebook-github-bot avatar Jan 08 '25 15:01 facebook-github-bot

hey @MaxFrax , thank you! I am on PTO this week. I will get to it next week if someone doesnt do it before me.

felipemello1 avatar Jan 09 '25 03:01 felipemello1

Hi @ebsmothers ! I have updated the pr with the following edits, as per your recommendation:

  • Created a stand alone method for the validation loop
  • Parameter run_val_every_n_steps to invoke validate in specific points of the training epoch
  • I also added max_validation_batches to cap the amount of batches run in each validation step

If there's any other feedback or comment, just let me know!

MaxFrax avatar Jan 15 '25 13:01 MaxFrax

Thanks for making the changes! I will take a look at this PR later today.

felipemello1 avatar Jan 15 '25 17:01 felipemello1

maybe we could at least update the lora single device configs to expose these fields so users know that it exists and we can check if the cfg field is None directly?

Lets do this as a follow up. I can use my script to bulk update. But lets make sure that we all agree on how it should like in the config.

felipemello1 avatar Jan 15 '25 17:01 felipemello1

Codecov Report

Attention: Patch coverage is 0% with 28 lines in your changes missing coverage. Please review.

Project coverage is 23.93%. Comparing base (213f386) to head (df8cd1e). Report is 18 commits behind head on main.

Files with missing lines Patch % Lines
recipes/lora_finetune_single_device.py 0.00% 28 Missing :warning:

:exclamation: There is a different number of reports uploaded between BASE (213f386) and HEAD (df8cd1e). Click for more details.

HEAD has 6 uploads less than BASE
Flag BASE (213f386) HEAD (df8cd1e)
9 3
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2238       +/-   ##
===========================================
- Coverage   65.41%   23.93%   -41.49%     
===========================================
  Files         344      357       +13     
  Lines       20658    21153      +495     
===========================================
- Hits        13514     5062     -8452     
- Misses       7144    16091     +8947     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Jan 15 '25 18:01 codecov-commenter

Thanks @felipemello1 ! Some help on the testing side would be much appreciated. When you say:

  1. An example of how the config should look like. The UI should play a big factor on this PR. what do you exactly mean?

Should I provide a recipe using the validation dataset? Are we talking about the docs? Let me know more precisely what I should do, and I'll be happy to look into it.

MaxFrax avatar Jan 16 '25 15:01 MaxFrax

@MaxFrax

Should I provide a recipe using the validation dataset? Are we talking about the docs? Let me know more precisely what I should do, and I'll be happy to look into it.

Your PR only contains changes to the recipe. I would encourage you to:

  • make changes to one of the configs too, to illustrate how users would use it
  • Put the command to launch this config in the description of the PR, under the testing section
  • share an image of the logs generated in weights and biases under the testing section

felipemello1 avatar Jan 16 '25 16:01 felipemello1

@MaxFrax

Should I provide a recipe using the validation dataset? Are we talking about the docs? Let me know more precisely what I should do, and I'll be happy to look into it.

Your PR only contains changes to the recipe. I would encourage you to:

  • make changes to one of the configs too, to illustrate how users would use it
  • Put the command to launch this config in the description of the PR, under the testing section
  • share an image of the logs generated in weights and biases under the testing section

Thanks for the feedback!

I have edited the llama3_2/1B_lora_single_device config and uploaded the W&B screen (I capped the training steps and changed batch sizes to make the screenshot instead of running exactly the committed recipe). I’ve resolved all the issues raised in the PR review, except for the model evaluation mode and the validation memory logging.

MaxFrax avatar Jan 26 '25 18:01 MaxFrax

I'd like to check on the status of this PR - our team has been testing #2464 and it has been very useful but we make heavy use of single device recipes. Will this move forward or would an implementation of #2464 for the single device recipes be the goal?

niznik-dev avatar Apr 21 '25 18:04 niznik-dev

hey @niznik-dev , thanks for the comment. @MaxFrax did a fantastic job here, but a few follow ups were missing. The https://github.com/pytorch/torchtune/pull/2464 was put up and these last few gaps were addressed, so we landed it.

I believe that @krammnic was going to work on bringing it to other recipes as well. For now, if you want, you can go ahead with how it was implemented in https://github.com/pytorch/torchtune/pull/2464

felipemello1 avatar Apr 21 '25 18:04 felipemello1