transformers icon indicating copy to clipboard operation
transformers copied to clipboard

[BUG] GPT-2 tokenizer is NOT invertible

Open jdeschena opened this issue 7 months ago • 7 comments

System Info

Hello,

It is my understanding that the gpt-2 tokenizer, obtained with AutoTokenizer.from_pretrained("gpt2"), should be invertible. That is, given a sentence text, we should have that

text == tokenizer.decode(tokenizer(text, add_special_tokens=False)["input_ids"])

However, it is not the case, unlike the tiktoken reference implementation, which is correctly invertible.

For example, given the sentence Is this restaurant family-friendly ? Yes No Unsure ? This is a follow-up sentence ., encoding + decoding removes the space before punctuations, yielding a different sentence.

I have tried instantiating the tokenizer using GPT2Tokenizer.from_pretrained("openai-community/gpt2"), and using the options add_prefix_space=True or is_split_into_words=True, but the problem persists.

Hence, it looks like a bug to me, since BPE tokenizers should be invertible, as far as I understand.

Who can help?

@ArthurZucker

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

Run this code, and you should see the bug. I am using transformers==4.38.2

#gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
gpt2_tokenizer =  GPT2Tokenizer.from_pretrained("openai-community/gpt2")
oai_tokenizer = tiktoken.get_encoding("gpt2")

orig = "Is this restaurant family-friendly ? Yes No Unsure ? This is an other sentence ."

hf_enc = gpt2_tokenizer(orig)["input_ids"]
hf_dec = gpt2_tokenizer.decode(hf_enc)

oai_enc = oai_tokenizer.encode(orig)
oai_dec = oai_tokenizer.decode(oai_enc)

print(hf_dec)
print(oai_dec)

Expected behavior

The two decoded sentence should be equal, yet they are not.

jdeschena avatar Jul 10 '24 08:07 jdeschena