transformers icon indicating copy to clipboard operation
transformers copied to clipboard

AutoTokenizer: Phi-3 drops spaces when decodes a token at a time

Open Andrei-Aksionov opened this issue 1 year ago • 1 comments
trafficstars

System Info

  • transformers version: 4.41.2
  • Platform: macOS-14.5-x86_64-i386-64bit
  • Python version: 3.11.6
  • Huggingface_hub version: 0.23.4
  • Safetensors version: 0.4.3
  • Accelerate version: 0.31.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.2 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

phi_2_tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
phi_3_tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

for name, tokenizer in (("phi-2", phi_2_tokenizer), ("phi-3", phi_3_tokenizer)):
    print(f"Tokenizer: {name}")
    tokens = tokenizer.encode("This is a test string")
    print(f"{tokens=}")
    print(tokenizer.decode(tokens))
    print("".join([tokenizer.decode(token) for token in tokens]))
    print("-" * 50)
Tokenizer: phi-2
tokens=[1212, 318, 257, 1332, 4731]
This is a test string
This is a test string
--------------------------------------------------
Tokenizer: phi-3
tokens=[1, 910, 338, 263, 1243, 1347]
<s> This is a test string
<s>Thisisateststring
--------------------------------------------------

Expected behavior

I expect that, even if I decode a single token at a time, the resulting string should contain spaces between tokens. As one can see, with Phi-2 model there are no problems, but for some reason Phi-3 does produce such a concatenated string.

Andrei-Aksionov avatar Jun 26 '24 15:06 Andrei-Aksionov

cc @itazap

ArthurZucker avatar Jun 28 '24 16:06 ArthurZucker

Hey @Andrei-Aksionov , thanks for the reproducer! It has to do with Phi-3 being based on the LlamaTokenizerFast and Phi-2 on CodeGen. LlamaTokenizerFast strips leading whitespace in order to manually add a prefix space on add_prefix_space. I'm looking into a fix now that handles this better!

itazap avatar Jul 01 '24 13:07 itazap

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 27 '24 08:07 github-actions[bot]

Is there currently a workaround for this behavior?

mattf1n avatar Apr 28 '25 16:04 mattf1n

Hey @mattf1n

Yes, I've implemented a workaround here. Don't know what are edge cases, but seems that it works.

Andrei-Aksionov avatar Apr 28 '25 16:04 Andrei-Aksionov