torchtune Implement step based checkpointing

trafficstars

Context

What is the purpose of this PR? Is it to

[x] add a new feature
[ ] fix a bug
[ ] update tests and/or documentation
[ ] other (please add here)

Closes #2105. This is a widely requested feature that allows users to have greater control over checkpointing frequency in torchtune.

TODO: Add commentary on design decisions. Acknowledge spaghetti code. Beg forgiveness.

Changelog

Update FullModelHFCheckpointer to accept a step parameter when saving a checkpoint. Use that step to designate the checkpoint folder name. Keep epoch_{} as a fall-back for BC.
Modify the full_finetune_single_device.py recipe to utilize step-based checkpointing.
Add tests for `full_finetune_single_device.py`` recipe w/ step-based checkpointing.

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

[x] run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
[x] add unit tests for any new functionality
[x] update docstrings for any new or updated methods or classes
[x] run unit tests via pytest tests
[x] run recipe tests via pytest tests -m integration_test
[x] manually run any new or modified recipes with sufficient proof of correctness
[ ] include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Evidence of correct number of checkpoints being saved

(joe-torchtune) [[email protected] ~/projects/joe-torchtune (impl-step-based-ckpt)]$ ls /tmp/torchtune/llama3_2_1B/full_single_device/
step_100  step_125  step_150  step_175  step_200  step_25  step_50  step_75  torchtune_config.yaml

Evidence of correct resuming from ckpt mid-epoch Screenshot 2025-02-28 at 4 59 52 PM

Evidence of correct resuming from ckpt at epoch boundary Screenshot 2025-02-28 at 5 00 19 PM

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it. Here is a docstring example and a tutorial example

[ ] I did not change any public API
[x] I have added an example to docs or docstrings

Feb 11 '25 21:02 joecummings

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2384

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:heavy_exclamation_mark: 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

VolumeLimitExceeded Issue for linux.2xlarge and linux.4xlarge

:x: 2 New Failures, 2 Unrelated Failures

As of commit 650d91d9208a9e9e2fb47ecd483d08ffdf7d9528 with merge base 3d735916bd9efca600f79fe8d77a757c1160279a ():

NEW FAILURES - The following jobs have failed:

GPU tests / gpu_test (3.11, stable) (gh) tests/recipes/test_qat_lora_finetune_distributed.py::TestQATLoRAFinetuneDistributedRecipe::test_training_state_on_resume_with_async_checkpointing[llama3/8B_qat_lora-llama3-tune-False]
Lint / lint (3.10) (gh) Process completed with exit code 1.

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

GPU tests / gpu_test (3.10, stable) (gh) (trunk failure) ##[error]The operation was canceled.
GPU tests / gpu_test (3.9, stable) (gh) (trunk failure) ##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Feb 11 '25 21:02 pytorch-bot[bot]

[x] recipe_state is still saved to ${output_dir}, not ${output_dir}/step_XXX
resume_from_checkpoint logic should be updated
- RN it looks for ${output_dir} checkpoint, not step_XXX
- maybe replace top level cfg.resume_from_checkpoint to have cfg.checkpointer.resume_from which is either "latest" (default) or the path to the checkpoint to resume from. Or separate mutually exclusive resume_from: /path/ and resume_from_latest: True
- offtopic but cfg.resume_from_checkpoint is mentioned in code as deprecated and replaced by should_load_recipe_state but de facto resume_from_checkpoint is mandatory and should_load_recipe_state doesn't work
[x] recipe_state has proper step and epoch to continue from but the train cycle still starts from 0 -> logs start from 0 & checkpointing start from 0
[x] lr schedulers aren't synced with the resume step
maybe save the wandb run?..... 🥺

Feb 13 '25 00:02 bogdansalyp

Codecov Report

Attention: Patch coverage is 26.06061% with 244 lines in your changes missing coverage. Please review.

Project coverage is 59.86%. Comparing base (3134f90) to head (ce41c15). Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
torchtune/training/checkpointing/_checkpointer.py	47.27%	58 Missing :warning:
tests/recipes/test_full_finetune_single_device.py	19.67%	49 Missing :warning:
recipes/full_finetune_single_device.py	0.00%	47 Missing :warning:
...htune/training/checkpointing/_checkpoint_client.py	0.00%	31 Missing :warning:
recipes/full_finetune_distributed.py	0.00%	26 Missing :warning:
tests/recipes/test_qat_distributed.py	19.23%	21 Missing :warning:
torchtune/training/checkpointing/_utils.py	75.00%	3 Missing :warning:
tests/recipes/test_full_finetune_distributed.py	33.33%	2 Missing :warning:
...cipes/test_knowledge_distillation_single_device.py	33.33%	2 Missing :warning:
tests/recipes/test_full_dpo_distributed.py	50.00%	1 Missing :warning:
... and 4 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2384      +/-   ##
==========================================
- Coverage   60.14%   59.86%   -0.29%     
==========================================
  Files         437      437              
  Lines       26912    27028     +116     
==========================================
- Hits        16187    16181       -6     
- Misses      10725    10847     +122

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Feb 28 '25 16:02 codecov-commenter

rebased to #2869

Jul 15 '25 16:07 bogdansalyp

torchtune torchtune copied to clipboard

Implement step based checkpointing

Context

Changelog

Test plan

UX

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2384

:heavy_exclamation_mark: 1 Active SEVs

:x: 2 New Failures, 2 Unrelated Failures

Codecov Report

torchtune
torchtune copied to clipboard