transformers
transformers copied to clipboard
AutoTokenizer: Phi-3 drops spaces when decodes a token at a time
System Info
transformersversion: 4.41.2- Platform: macOS-14.5-x86_64-i386-64bit
- Python version: 3.11.6
- Huggingface_hub version: 0.23.4
- Safetensors version: 0.4.3
- Accelerate version: 0.31.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.2.2 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
@ArthurZucker
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
from transformers import AutoTokenizer
phi_2_tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
phi_3_tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
for name, tokenizer in (("phi-2", phi_2_tokenizer), ("phi-3", phi_3_tokenizer)):
print(f"Tokenizer: {name}")
tokens = tokenizer.encode("This is a test string")
print(f"{tokens=}")
print(tokenizer.decode(tokens))
print("".join([tokenizer.decode(token) for token in tokens]))
print("-" * 50)
Tokenizer: phi-2
tokens=[1212, 318, 257, 1332, 4731]
This is a test string
This is a test string
--------------------------------------------------
Tokenizer: phi-3
tokens=[1, 910, 338, 263, 1243, 1347]
<s> This is a test string
<s>Thisisateststring
--------------------------------------------------
Expected behavior
I expect that, even if I decode a single token at a time, the resulting string should contain spaces between tokens. As one can see, with Phi-2 model there are no problems, but for some reason Phi-3 does produce such a concatenated string.
cc @itazap
Hey @Andrei-Aksionov , thanks for the reproducer! It has to do with Phi-3 being based on the LlamaTokenizerFast and Phi-2 on CodeGen. LlamaTokenizerFast strips leading whitespace in order to manually add a prefix space on add_prefix_space. I'm looking into a fix now that handles this better!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Is there currently a workaround for this behavior?
Hey @mattf1n
Yes, I've implemented a workaround here. Don't know what are edge cases, but seems that it works.