torchtune
torchtune copied to clipboard
[WIP] HuggingFaceModelTokenizer
Context
What is the purpose of this PR? Is it to
- [X] add a new feature
- [ ] fix a bug
- [ ] update tests and/or documentation
- [ ] other (please add here)
Please link to any issues this PR addresses.
Changelog
What are the changes made in this PR?
- #2706
Test plan
Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.
- [X] run pre-commit hooks and linters (make sure you've first installed via
pre-commit install) - [ ] add unit tests for any new functionality
- [ ] update docstrings for any new or updated methods or classes
- [ ] run unit tests via
pytest tests - [ ] run recipe tests via
pytest tests -m integration_test - [ ] manually run any new or modified recipes with sufficient proof of correctness
- [ ] include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)
UX
If your function changed a public API, please add a dummy example of what the user experience will look like when calling it. Here is a docstring example and a tutorial example
- [ ] I did not change any public API
- [ ] I have added an example to docs or docstrings
Basically, this is a first pass (I'm thinking about how to add masking), but jinja render works surprisingly well.
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2723
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:heavy_exclamation_mark: 1 Merge Blocking SEVs
There is 1 active merge blocking SEVs. Please view them below:
- (merge blocking) Long CI Queue Times: VolumeLimitExceeded
If you must merge, use @pytorchbot merge -f.
:white_check_mark: No Failures
As of commit 2740df788d9f32beb4e25bcf4c2d3fb7cf455aa6 with merge base 4ff30cae5041c381188091dda44e3dc9e0819ce0 ():
:green_heart: Looks good so far! There are no failures yet. :green_heart:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Let me push unit test and we will iterate on this one more time
@ebsmothers Let's iterate
@ebsmothers @joecummings fixed CI
"Extra <|begin_of_text|> at the beginning"
Not fully understand, it seems to me that we've discussed that this behavior is kind of expected for the gemma3 tokenizer and it is equivalent to the transformers version
"Extra <|begin_of_text|> at the beginning"
Not fully understand, it seems to me that we've discussed that this behavior is kind of expected for the gemma3 tokenizer and it is equivalent to the transformers version
Oh my script was for Llama3, I see it there as well. Let me dig in a bit further though
In addition to @felipemello1's comments, let's sanity check that we get the same results with this as we get with HF's apply_chat_template. Similar to what I shared yesterday, if we can show a script like this gives the same results for a couple of models this should be good to go. (At least for this script, I get duplicate BOS at the beginning, which should be resolvable via my comment. I also get a newline in the HF one which ours does not have, not sure what that's about..)
Separately, please make sure to use a smaller version of tokenizer.json in your unit test when you get a chance.
@ebsmothers I don't like the fact that CI failed only for 3.9 and with:
ValueError: Unable to load checkpoint from /tmp/pytest-of-ec2-user/pytest-0/test_training_state_on_resume_14/recipe_state/recipe_state.pt.
It might be some separate issue....
Codecov Report
Attention: Patch coverage is 96.87500% with 3 lines in your changes missing coverage. Please review.
Project coverage is 60.22%. Comparing base (
9cb77af) to head (c4b9c35). Report is 6 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| ...une/modules/transforms/tokenizers/_hf_tokenizer.py | 96.00% | 3 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #2723 +/- ##
==========================================
+ Coverage 60.08% 60.22% +0.14%
==========================================
Files 435 435
Lines 26742 26828 +86
==========================================
+ Hits 16067 16157 +90
+ Misses 10675 10671 -4
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@krammnic Did you get a chance to reduce the file size and also minify the tokenizer test assets? The diff should just be one line for those files.
@krammnic Did you get a chance to reduce the file size and also minify the tokenizer test assets? The diff should just be one line for those files.
Have done it briefly, but reduced on 2 times. I think it is sufficient (under 5 mb)
I don't think that we've minified any other assets actually