torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Switch remainder of recipe tests over to HF format checkpoints

Open adheep04 opened this issue 8 months ago • 3 comments

Context

  • [x] update tests by switching to hf format
  • addresses #2816
  • currently a draft (need to finish 4 more

Changelog

  • Switched the formatting of the following recipe tests to hf format similar to issue #2815:
  • test_knowledge_distillation_distributed.py
  • test_knowledge_distillation_single_device.py
  • test_lora_dpo_single_device.py
  • test_lora_finetune_distributed.py
  • test_lora_finetune_single_device.py
  • test_qat_distributed.py
  • test_qat_lora_finetune_distributed.py
  • test_qat_single_device.py utils.py
  • Added support for a lora/dora/qlora config for "llama3_hf_138m" in utils.py

Remaining recipes:

  • test_eleuther_eval.py
  • test_full_dpo_distributed.py
  • test_ppo_full_finetune_single_device.py
  • test_dpo_distributed.py

TODO

  • [ ] complete remaining 4 recipe tests
  • [ ] run unit tests via pytest tests
  • [ ] run recipe tests via pytest tests -m integration_test

Questions/Notes

  • I noticed in the fully-fine-tune tests, the tolerance changed from rtol=1e-4 to rtol=1e-3, I have not changed the rtol values in any of the recipes I changed, is this okay or should the tolerance threshold increase for the others too?
  • I ran a few tests on my machine and got quite a few OOMs. Not sure if I will be able to change the expected loss so they are currently either commented out or marked with a "# TODO". I may need some help running the tests in case I'm unable to, however I could work around this by lowering the vocab size to fix this if that is okay?
  • I committed an unfinished version of test_dpo_distributed on accident.

@krammnic Please let me know if you notice any bugs or errors or have any pointers for me. I should have the remaining recipes done within the next couple days, sorry again the long wait! life's been pretty crazy haha.

adheep04 avatar Jul 08 '25 00:07 adheep04

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2871

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Jul 08 '25 00:07 pytorch-bot[bot]

Completed

  • all but 1 recipe

New TODO

  • [ ] switch test_lora_dpo_distributed to hf format
  • [ ] run tests and update expected losses (may need some help to run some tests in case of OOMs!)

adheep04 avatar Jul 11 '25 20:07 adheep04

Hey! Thanks for the PR. I will review tomorrow

krammnic avatar Jul 11 '25 20:07 krammnic