torchtune Adding EOS Tokens to Qwen Models

Fix #2481

Current status 👇

from torchtune.models.qwen2 import Qwen2Tokenizer
from torchtune.data import Message

messages = [
    Message(role="user", content="Hello world!", masked=True),
    Message(role="assistant", content="How are you?", masked=False),
]

tokenizer = Qwen2Tokenizer(
    path="Qwen2-0.5B-Instruct/vocab.json",
    merges_file="Qwen2-0.5B-Instruct/merges.txt",
)
tokenized_text = tokenizer.tokenize_messages(messages)
print(tokenized_text)


# (
# [None, 151644, 872, 198, 9707, 1879, 0, 151645, 198, 151644, 77091, 198, 4340, 525, 498, 30, 151645, 198, 151645],
# [True, True, True, True, True, True, True, True, True, False, False, False, False, False, False, False, False, False, True]
# )

Not very sure why None comes as a token. Am I missing something obvious here? Any help would be great.

After the initial review, I will add tests.

Mar 18 '25 15:03 ariG23498

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2512

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:x: 3 New Failures, 3 Cancelled Jobs, 1 Unrelated Failure

As of commit c82effe5564c0f5f61f1caa206c4994befa0d7cb with merge base 20bdf1086e947fd010dd79faf1c78600062cb162 ():

NEW FAILURES - The following jobs have failed:

Build Docs / build_docs (3.11) (gh) Process completed with exit code 2.
GPU tests / gpu_test (3.11, stable) (gh) Process completed with exit code 2.
Unit Test / unit_tests (3.11) (gh) Process completed with exit code 2.

CANCELLED JOBS - The following jobs were cancelled. Please retry:

GPU tests / gpu_test (3.9, stable) (gh) Process completed with exit code 2.
Unit Test / unit_tests (3.10) (gh) ##[error]The operation was canceled.
Unit Test / unit_tests (3.9) (gh) ##[error]The operation was canceled.

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

GPU tests / gpu_test (3.10, stable) (gh) (trunk failure)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Mar 18 '25 15:03 pytorch-bot[bot]

A gentle ping on this.

Apr 28 '25 06:04 ariG23498

Thanks for the ping, sorry this fell through the cracks!

The none is coming from the mis-use of the bos_id here. bos_id is not actually defined for Qwen2 so it's showing up as none.

Apr 28 '25 16:04 joecummings

Interesting.

Can I get a quick review on the changes I have made so far, just to be sure about my implementation.

Apr 29 '25 02:04 ariG23498

Interesting.

Can I get a quick review on the changes I have made so far, just to be sure about my implementation.

Code looks nice and clean and I don't see any obvious errors (with the exception of the bos_id thing I already pointed out to you). Of course, the real test will be the unit tests :)

Apr 30 '25 20:04 joecummings

Hey @ariG23498 , @joecummings , I am new to torchtune and OSS, I was thinking of taking Qwen2_5. This PR will solve the issue for both Qwen2 and qwen2_5 ?

I think qwen2_5's tokenizer is inheriting from qwen2, so no need to implement for qwen2_5?

May 04 '25 12:05 ysurs

@joecummings I think the PR is in a good state to review. The tests pass on my end.

@ysurs I checked the Qwen 2.5 tokenizer, it does inherit Qwen 2's tokenizer, but the edits still need to be made I guess, as the current tokenizers API would have changed.

May 06 '25 10:05 ariG23498

Codecov Report

Attention: Patch coverage is 93.75000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 60.70%. Comparing base (fb6d4cd) to head (3d979b7).

Files with missing lines	Patch %	Lines
torchtune/models/qwen2/_tokenizer.py	93.33%	2 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2512      +/-   ##
==========================================
- Coverage   64.26%   60.70%   -3.56%     
==========================================
  Files         427      427              
  Lines       25988    25997       +9     
==========================================
- Hits        16700    15781     -919     
- Misses       9288    10216     +928

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

May 07 '25 23:05 codecov-commenter

Hi @joecummings,

The PR was missing wrong docstrings, which was fixed. Can you trigger the tests, so that I can be certain about the changes?

May 08 '25 16:05 ariG23498

Hi @joecummings,

The PR was missing wrong docstrings, which was fixed. Can you trigger the tests, so that I can be certain about the changes?

If you look at the contributing guide, you should be able to run these tests very quickly on your local machine to ensure that you're passing the linter.

May 08 '25 16:05 joecummings

Hi folks!

Let me know if I need to work on something to get this to the finishing line 🤗

May 27 '25 02:05 ariG23498

torchtune torchtune copied to clipboard

Adding EOS Tokens to Qwen Models

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2512

:x: 3 New Failures, 3 Cancelled Jobs, 1 Unrelated Failure

Codecov Report

torchtune
torchtune copied to clipboard