transformers
transformers copied to clipboard
[`Blenderbot`] Discrepancy between `BlenderbotTokenizer` and `BlenderbotTokenizerFast`
System Info
main branch
Who can help?
No response
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
The initial issue is that I didn't get the same generated output between when using BlenderbotTokenizer and BlenderboTokenizerFast. The initial script to reproduce is the following:
from transformers import BlenderbotTokenizer, BlenderbotTokenizerFast, BlenderbotForConditionalGeneration, AutoTokenizer
mname = "facebook/blenderbot-400M-distill"
model = BlenderbotForConditionalGeneration.from_pretrained(mname)
tokenizer = BlenderbotTokenizer.from_pretrained(mname, add_prefix_space=False)
tokenizer_fast = BlenderbotTokenizerFast.from_pretrained(mname, add_prefix_space=False)
NEXT_UTTERANCE = (
"My friends are cool but they eat too many carbs.</s> <s> That's unfortunate. "
"Are they trying to lose weight or are they just trying to be healthier?</s> "
"<s> I'm not sure."
)
inputs = tokenizer([NEXT_UTTERANCE], return_tensors="pt")
inputs_fast = tokenizer_fast([NEXT_UTTERANCE], return_tensors="pt")
# check that the fast tokenizer is the same as the slow one
assert torch.all(inputs.input_ids == inputs_fast.input_ids)
from transformers import BlenderbotTokenizer, BlenderbotTokenizerFast, BlenderbotForConditionalGeneration, AutoTokenizer
mname = "facebook/blenderbot-400M-distill"
model = BlenderbotForConditionalGeneration.from_pretrained(mname)
tokenizer = BlenderbotTokenizer.from_pretrained(mname)
tokenizer_fast = BlenderbotTokenizerFast.from_pretrained(mname)
def generate(tokenizer):
UTTERANCE = "My friends are cool but they eat too many carbs."
inputs = tokenizer([UTTERANCE], return_tensors="pt")
NEXT_UTTERANCE = (
"My friends are cool but they eat too many carbs.</s> <s>That's unfortunate. "
"Are they trying to lose weight or are they just trying to be healthier?</s> "
"<s> I'm not sure."
)
inputs = tokenizer([NEXT_UTTERANCE], return_tensors="pt")
next_reply_ids = model.generate(**inputs)
# print("decoded input : ", tokenizer.batch_decode(inputs.input_ids, skip_special_tokens=False)[0])
print("Bot: ", tokenizer.batch_decode(next_reply_ids, skip_special_tokens=False)[0])
generate(tokenizer)
generate(tokenizer_fast)
>>> That's too bad. Have you tried encouraging them to change their eating habits?
>>> I see. Well, it's good that they're trying to change their eating habits.
Interestingly this always pass:
import torch
from transformers import BlenderbotTokenizer, BlenderbotTokenizerFast, BlenderbotForConditionalGeneration, AutoTokenizer
mname = "facebook/blenderbot-400M-distill"
model = BlenderbotForConditionalGeneration.from_pretrained(mname)
tokenizer = BlenderbotTokenizer.from_pretrained(mname)
tokenizer_fast = BlenderbotTokenizerFast.from_pretrained(mname)
NEXT_UTTERANCE = (
"My friends are cool but they eat too many carbs.</s> <s> That's unfortunate. "
"Are they trying to lose weight or are they just trying to be healthier?</s> "
"<s> I'm not sure."
)
UTTERANCE = "My friends are cool but they eat too many carbs."
_ = tokenizer([UTTERANCE], return_tensors="pt")
_ = tokenizer_fast([UTTERANCE], return_tensors="pt")
inputs = tokenizer([NEXT_UTTERANCE], return_tensors="pt")
inputs_fast = tokenizer_fast([NEXT_UTTERANCE], return_tensors="pt")
# check that the fast tokenizer is the same as the slow one
assert torch.all(inputs.input_ids == inputs_fast.input_ids)
next_reply_ids = model.generate(**inputs)
next_reply_ids_fast = model.generate(**inputs_fast)
assert torch.all(inputs.input_ids == inputs_fast.input_ids)
print(tokenizer.batch_decode(next_reply_ids))
>>> I see. Well, it's good that they're trying to change their eating habits.
print(tokenizer_fast.batch_decode(next_reply_ids_fast))
>>> I see. Well, it's good that they're trying to change their eating habits.
Expected behavior
Both generations should be the same ideally!
cc @ydshieh @ArthurZucker
Hi @younesbelkada
It would be nice if you also show inputs and inputs_fast (we can definitely check ourselves), or mention if this is the same or not :-)
Thanks a lot! I have updated the description with more details
I'll have a look but the fact that the second scripts works well is already good. Will check that all the inputs_ids and generated_ids are the same
@ArthurZucker I wanted to work on this issue, I did little more digging and found out that this issue (difference in input_ids by the tokenizer) happens when <s> is not followed by a space. The 2nd script works as the is space between <s> and next character.
This could mean that the clean_up_tokenization_space or spaces_between_special_tokens args don't have the same values in the different models.
Okay, Let me dig further in this direction.
You can now control the clean_up_tokenization_space parameter when initialising a model (merged in #22341) which should have fixed this issue (need to update the param)