torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Adding EOS Tokens to Qwen Models

Open ariG23498 opened this issue 8 months ago • 10 comments

Fix #2481

Current status 👇

from torchtune.models.qwen2 import Qwen2Tokenizer
from torchtune.data import Message

messages = [
    Message(role="user", content="Hello world!", masked=True),
    Message(role="assistant", content="How are you?", masked=False),
]

tokenizer = Qwen2Tokenizer(
    path="Qwen2-0.5B-Instruct/vocab.json",
    merges_file="Qwen2-0.5B-Instruct/merges.txt",
)
tokenized_text = tokenizer.tokenize_messages(messages)
print(tokenized_text)


# (
# [None, 151644, 872, 198, 9707, 1879, 0, 151645, 198, 151644, 77091, 198, 4340, 525, 498, 30, 151645, 198, 151645],
# [True, True, True, True, True, True, True, True, True, False, False, False, False, False, False, False, False, False, True]
# )

Not very sure why None comes as a token. Am I missing something obvious here? Any help would be great.

After the initial review, I will add tests.

ariG23498 avatar Mar 18 '25 15:03 ariG23498

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2512

Note: Links to docs will display an error until the docs builds have been completed.

:x: 3 New Failures, 3 Cancelled Jobs, 1 Unrelated Failure

As of commit c82effe5564c0f5f61f1caa206c4994befa0d7cb with merge base 20bdf1086e947fd010dd79faf1c78600062cb162 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Mar 18 '25 15:03 pytorch-bot[bot]

A gentle ping on this.

ariG23498 avatar Apr 28 '25 06:04 ariG23498

Thanks for the ping, sorry this fell through the cracks!

The none is coming from the mis-use of the bos_id here. bos_id is not actually defined for Qwen2 so it's showing up as none.

joecummings avatar Apr 28 '25 16:04 joecummings

Interesting.

Can I get a quick review on the changes I have made so far, just to be sure about my implementation.

ariG23498 avatar Apr 29 '25 02:04 ariG23498

Interesting.

Can I get a quick review on the changes I have made so far, just to be sure about my implementation.

Code looks nice and clean and I don't see any obvious errors (with the exception of the bos_id thing I already pointed out to you). Of course, the real test will be the unit tests :)

joecummings avatar Apr 30 '25 20:04 joecummings

Hey @ariG23498 , @joecummings , I am new to torchtune and OSS, I was thinking of taking Qwen2_5. This PR will solve the issue for both Qwen2 and qwen2_5 ?

I think qwen2_5's tokenizer is inheriting from qwen2, so no need to implement for qwen2_5?

ysurs avatar May 04 '25 12:05 ysurs

@joecummings I think the PR is in a good state to review. The tests pass on my end.

@ysurs I checked the Qwen 2.5 tokenizer, it does inherit Qwen 2's tokenizer, but the edits still need to be made I guess, as the current tokenizers API would have changed.

ariG23498 avatar May 06 '25 10:05 ariG23498

Codecov Report

Attention: Patch coverage is 93.75000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 60.70%. Comparing base (fb6d4cd) to head (3d979b7).

Files with missing lines Patch % Lines
torchtune/models/qwen2/_tokenizer.py 93.33% 2 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2512      +/-   ##
==========================================
- Coverage   64.26%   60.70%   -3.56%     
==========================================
  Files         427      427              
  Lines       25988    25997       +9     
==========================================
- Hits        16700    15781     -919     
- Misses       9288    10216     +928     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar May 07 '25 23:05 codecov-commenter

Hi @joecummings,

The PR was missing wrong docstrings, which was fixed. Can you trigger the tests, so that I can be certain about the changes?

ariG23498 avatar May 08 '25 16:05 ariG23498

Hi @joecummings,

The PR was missing wrong docstrings, which was fixed. Can you trigger the tests, so that I can be certain about the changes?

If you look at the contributing guide, you should be able to run these tests very quickly on your local machine to ensure that you're passing the linter.

joecummings avatar May 08 '25 16:05 joecummings

Hi folks!

Let me know if I need to work on something to get this to the finishing line 🤗

ariG23498 avatar May 27 '25 02:05 ariG23498