transformers AutoTokenizer: Phi-3 drops spaces when decodes a token at a time

trafficstars

System Info

transformers version: 4.41.2
Platform: macOS-14.5-x86_64-i386-64bit
Python version: 3.11.6
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.2 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

phi_2_tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
phi_3_tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

for name, tokenizer in (("phi-2", phi_2_tokenizer), ("phi-3", phi_3_tokenizer)):
    print(f"Tokenizer: {name}")
    tokens = tokenizer.encode("This is a test string")
    print(f"{tokens=}")
    print(tokenizer.decode(tokens))
    print("".join([tokenizer.decode(token) for token in tokens]))
    print("-" * 50)

Tokenizer: phi-2
tokens=[1212, 318, 257, 1332, 4731]
This is a test string
This is a test string
--------------------------------------------------
Tokenizer: phi-3
tokens=[1, 910, 338, 263, 1243, 1347]
<s> This is a test string
<s>Thisisateststring
--------------------------------------------------

Expected behavior

I expect that, even if I decode a single token at a time, the resulting string should contain spaces between tokens. As one can see, with Phi-2 model there are no problems, but for some reason Phi-3 does produce such a concatenated string.

Jun 26 '24 15:06 Andrei-Aksionov

cc @itazap

Jun 28 '24 16:06 ArthurZucker

Hey @Andrei-Aksionov , thanks for the reproducer! It has to do with Phi-3 being based on the LlamaTokenizerFast and Phi-2 on CodeGen. LlamaTokenizerFast strips leading whitespace in order to manually add a prefix space on add_prefix_space. I'm looking into a fix now that handles this better!

Jul 01 '24 13:07 itazap

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 27 '24 08:07 github-actions[bot]

Is there currently a workaround for this behavior?

Apr 28 '25 16:04 mattf1n

Hey @mattf1n

Yes, I've implemented a workaround here. Don't know what are edge cases, but seems that it works.

Apr 28 '25 16:04 Andrei-Aksionov

transformers transformers copied to clipboard

AutoTokenizer: Phi-3 drops spaces when decodes a token at a time

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard