torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

[WIP] HuggingFaceModelTokenizer

Open krammnic opened this issue 6 months ago • 1 comments

Context

What is the purpose of this PR? Is it to

  • [X] add a new feature
  • [ ] fix a bug
  • [ ] update tests and/or documentation
  • [ ] other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?

  • #2706

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

  • [X] run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • [ ] add unit tests for any new functionality
  • [ ] update docstrings for any new or updated methods or classes
  • [ ] run unit tests via pytest tests
  • [ ] run recipe tests via pytest tests -m integration_test
  • [ ] manually run any new or modified recipes with sufficient proof of correctness
  • [ ] include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it. Here is a docstring example and a tutorial example

  • [ ] I did not change any public API
  • [ ] I have added an example to docs or docstrings

Basically, this is a first pass (I'm thinking about how to add masking), but jinja render works surprisingly well.

krammnic avatar May 12 '25 17:05 krammnic

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2723

Note: Links to docs will display an error until the docs builds have been completed.

:heavy_exclamation_mark: 1 Merge Blocking SEVs

There is 1 active merge blocking SEVs. Please view them below:

If you must merge, use @pytorchbot merge -f.

:white_check_mark: No Failures

As of commit 2740df788d9f32beb4e25bcf4c2d3fb7cf455aa6 with merge base 4ff30cae5041c381188091dda44e3dc9e0819ce0 (image): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar May 12 '25 17:05 pytorch-bot[bot]

Let me push unit test and we will iterate on this one more time

krammnic avatar May 26 '25 18:05 krammnic

@ebsmothers Let's iterate

krammnic avatar May 28 '25 12:05 krammnic

@ebsmothers @joecummings fixed CI

krammnic avatar Jun 03 '25 19:06 krammnic

"Extra <|begin_of_text|> at the beginning"

Not fully understand, it seems to me that we've discussed that this behavior is kind of expected for the gemma3 tokenizer and it is equivalent to the transformers version

krammnic avatar Jun 04 '25 08:06 krammnic

"Extra <|begin_of_text|> at the beginning"

Not fully understand, it seems to me that we've discussed that this behavior is kind of expected for the gemma3 tokenizer and it is equivalent to the transformers version

Oh my script was for Llama3, I see it there as well. Let me dig in a bit further though

ebsmothers avatar Jun 04 '25 14:06 ebsmothers

In addition to @felipemello1's comments, let's sanity check that we get the same results with this as we get with HF's apply_chat_template. Similar to what I shared yesterday, if we can show a script like this gives the same results for a couple of models this should be good to go. (At least for this script, I get duplicate BOS at the beginning, which should be resolvable via my comment. I also get a newline in the HF one which ours does not have, not sure what that's about..)

Separately, please make sure to use a smaller version of tokenizer.json in your unit test when you get a chance.

ebsmothers avatar Jun 05 '25 00:06 ebsmothers

@ebsmothers I don't like the fact that CI failed only for 3.9 and with:

ValueError: Unable to load checkpoint from /tmp/pytest-of-ec2-user/pytest-0/test_training_state_on_resume_14/recipe_state/recipe_state.pt.

It might be some separate issue....

krammnic avatar Jun 05 '25 08:06 krammnic

Codecov Report

Attention: Patch coverage is 96.87500% with 3 lines in your changes missing coverage. Please review.

Project coverage is 60.22%. Comparing base (9cb77af) to head (c4b9c35). Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...une/modules/transforms/tokenizers/_hf_tokenizer.py 96.00% 3 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2723      +/-   ##
==========================================
+ Coverage   60.08%   60.22%   +0.14%     
==========================================
  Files         435      435              
  Lines       26742    26828      +86     
==========================================
+ Hits        16067    16157      +90     
+ Misses      10675    10671       -4     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Jun 05 '25 22:06 codecov-commenter

@krammnic Did you get a chance to reduce the file size and also minify the tokenizer test assets? The diff should just be one line for those files.

SalmanMohammadi avatar Jun 05 '25 22:06 SalmanMohammadi

@krammnic Did you get a chance to reduce the file size and also minify the tokenizer test assets? The diff should just be one line for those files.

Have done it briefly, but reduced on 2 times. I think it is sufficient (under 5 mb)

krammnic avatar Jun 05 '25 22:06 krammnic

I don't think that we've minified any other assets actually

krammnic avatar Jun 05 '25 22:06 krammnic