torchtune
torchtune copied to clipboard
Adding EOS Tokens to Qwen Models
Fix #2481
Current status 👇
from torchtune.models.qwen2 import Qwen2Tokenizer
from torchtune.data import Message
messages = [
Message(role="user", content="Hello world!", masked=True),
Message(role="assistant", content="How are you?", masked=False),
]
tokenizer = Qwen2Tokenizer(
path="Qwen2-0.5B-Instruct/vocab.json",
merges_file="Qwen2-0.5B-Instruct/merges.txt",
)
tokenized_text = tokenizer.tokenize_messages(messages)
print(tokenized_text)
# (
# [None, 151644, 872, 198, 9707, 1879, 0, 151645, 198, 151644, 77091, 198, 4340, 525, 498, 30, 151645, 198, 151645],
# [True, True, True, True, True, True, True, True, True, False, False, False, False, False, False, False, False, False, True]
# )
Not very sure why None comes as a token. Am I missing something obvious here? Any help would be great.
After the initial review, I will add tests.
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2512
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:x: 3 New Failures, 3 Cancelled Jobs, 1 Unrelated Failure
As of commit c82effe5564c0f5f61f1caa206c4994befa0d7cb with merge base 20bdf1086e947fd010dd79faf1c78600062cb162 ():
NEW FAILURES - The following jobs have failed:
- Build Docs / build_docs (3.11) (gh)
Process completed with exit code 2. - GPU tests / gpu_test (3.11, stable) (gh)
Process completed with exit code 2. - Unit Test / unit_tests (3.11) (gh)
Process completed with exit code 2.
CANCELLED JOBS - The following jobs were cancelled. Please retry:
- GPU tests / gpu_test (3.9, stable) (gh)
Process completed with exit code 2. - Unit Test / unit_tests (3.10) (gh)
##[error]The operation was canceled. - Unit Test / unit_tests (3.9) (gh)
##[error]The operation was canceled.
BROKEN TRUNK - The following job failed but were present on the merge base:
👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes.
A gentle ping on this.
Thanks for the ping, sorry this fell through the cracks!
The none is coming from the mis-use of the bos_id here. bos_id is not actually defined for Qwen2 so it's showing up as none.
Interesting.
Can I get a quick review on the changes I have made so far, just to be sure about my implementation.
Interesting.
Can I get a quick review on the changes I have made so far, just to be sure about my implementation.
Code looks nice and clean and I don't see any obvious errors (with the exception of the bos_id thing I already pointed out to you). Of course, the real test will be the unit tests :)
Hey @ariG23498 , @joecummings , I am new to torchtune and OSS, I was thinking of taking Qwen2_5. This PR will solve the issue for both Qwen2 and qwen2_5 ?
I think qwen2_5's tokenizer is inheriting from qwen2, so no need to implement for qwen2_5?
@joecummings I think the PR is in a good state to review. The tests pass on my end.
@ysurs I checked the Qwen 2.5 tokenizer, it does inherit Qwen 2's tokenizer, but the edits still need to be made I guess, as the current tokenizers API would have changed.
Codecov Report
Attention: Patch coverage is 93.75000% with 2 lines in your changes missing coverage. Please review.
Project coverage is 60.70%. Comparing base (
fb6d4cd) to head (3d979b7).
| Files with missing lines | Patch % | Lines |
|---|---|---|
| torchtune/models/qwen2/_tokenizer.py | 93.33% | 2 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #2512 +/- ##
==========================================
- Coverage 64.26% 60.70% -3.56%
==========================================
Files 427 427
Lines 25988 25997 +9
==========================================
- Hits 16700 15781 -919
- Misses 9288 10216 +928
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Hi @joecummings,
The PR was missing wrong docstrings, which was fixed. Can you trigger the tests, so that I can be certain about the changes?
Hi @joecummings,
The PR was missing wrong docstrings, which was fixed. Can you trigger the tests, so that I can be certain about the changes?
If you look at the contributing guide, you should be able to run these tests very quickly on your local machine to ensure that you're passing the linter.
Hi folks!
Let me know if I need to work on something to get this to the finishing line 🤗