transformers
transformers copied to clipboard
[BUG] GPT-2 tokenizer is NOT invertible
System Info
Hello,
It is my understanding that the gpt-2 tokenizer, obtained with AutoTokenizer.from_pretrained("gpt2")
, should be invertible. That is, given a sentence text
, we should have that
text == tokenizer.decode(tokenizer(text, add_special_tokens=False)["input_ids"])
However, it is not the case, unlike the tiktoken
reference implementation, which is correctly invertible.
For example, given the sentence Is this restaurant family-friendly ? Yes No Unsure ? This is a follow-up sentence .
, encoding + decoding removes the space before punctuations, yielding a different sentence.
I have tried instantiating the tokenizer using GPT2Tokenizer.from_pretrained("openai-community/gpt2")
, and using the options add_prefix_space=True
or is_split_into_words=True
, but the problem persists.
Hence, it looks like a bug to me, since BPE tokenizers should be invertible, as far as I understand.
Who can help?
@ArthurZucker
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Run this code, and you should see the bug. I am using transformers==4.38.2
#gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
oai_tokenizer = tiktoken.get_encoding("gpt2")
orig = "Is this restaurant family-friendly ? Yes No Unsure ? This is an other sentence ."
hf_enc = gpt2_tokenizer(orig)["input_ids"]
hf_dec = gpt2_tokenizer.decode(hf_enc)
oai_enc = oai_tokenizer.encode(orig)
oai_dec = oai_tokenizer.decode(oai_enc)
print(hf_dec)
print(oai_dec)
Expected behavior
The two decoded sentence should be equal, yet they are not.