transformers icon indicating copy to clipboard operation
transformers copied to clipboard

[`Blenderbot`] Discrepancy between `BlenderbotTokenizer` and `BlenderbotTokenizerFast`

Open younesbelkada opened this issue 2 years ago • 6 comments

System Info

main branch

Who can help?

No response

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

The initial issue is that I didn't get the same generated output between when using BlenderbotTokenizer and BlenderboTokenizerFast. The initial script to reproduce is the following:

from transformers import BlenderbotTokenizer, BlenderbotTokenizerFast, BlenderbotForConditionalGeneration, AutoTokenizer

mname = "facebook/blenderbot-400M-distill"
model = BlenderbotForConditionalGeneration.from_pretrained(mname)

tokenizer = BlenderbotTokenizer.from_pretrained(mname, add_prefix_space=False)
tokenizer_fast = BlenderbotTokenizerFast.from_pretrained(mname, add_prefix_space=False)


NEXT_UTTERANCE = (
    "My friends are cool but they eat too many carbs.</s> <s> That's unfortunate. "
    "Are they trying to lose weight or are they just trying to be healthier?</s> "
    "<s> I'm not sure."
)
inputs = tokenizer([NEXT_UTTERANCE], return_tensors="pt")
inputs_fast = tokenizer_fast([NEXT_UTTERANCE], return_tensors="pt")

# check that the fast tokenizer is the same as the slow one
assert torch.all(inputs.input_ids == inputs_fast.input_ids)

from transformers import BlenderbotTokenizer, BlenderbotTokenizerFast, BlenderbotForConditionalGeneration, AutoTokenizer

mname = "facebook/blenderbot-400M-distill"
model = BlenderbotForConditionalGeneration.from_pretrained(mname)

tokenizer = BlenderbotTokenizer.from_pretrained(mname)
tokenizer_fast = BlenderbotTokenizerFast.from_pretrained(mname)

def generate(tokenizer):
    UTTERANCE = "My friends are cool but they eat too many carbs."

    inputs = tokenizer([UTTERANCE], return_tensors="pt")

    NEXT_UTTERANCE = (
        "My friends are cool but they eat too many carbs.</s> <s>That's unfortunate. "
        "Are they trying to lose weight or are they just trying to be healthier?</s> "
        "<s> I'm not sure."
    )
    inputs = tokenizer([NEXT_UTTERANCE], return_tensors="pt")
    next_reply_ids = model.generate(**inputs)
    # print("decoded input : ", tokenizer.batch_decode(inputs.input_ids, skip_special_tokens=False)[0])
    print("Bot: ", tokenizer.batch_decode(next_reply_ids, skip_special_tokens=False)[0])

generate(tokenizer)
generate(tokenizer_fast)
>>> That's too bad. Have you tried encouraging them to change their eating habits?
>>> I see. Well, it's good that they're trying to change their eating habits.

Interestingly this always pass:

import torch
from transformers import BlenderbotTokenizer, BlenderbotTokenizerFast, BlenderbotForConditionalGeneration, AutoTokenizer

mname = "facebook/blenderbot-400M-distill"
model = BlenderbotForConditionalGeneration.from_pretrained(mname)

tokenizer = BlenderbotTokenizer.from_pretrained(mname)
tokenizer_fast = BlenderbotTokenizerFast.from_pretrained(mname)


NEXT_UTTERANCE = (
    "My friends are cool but they eat too many carbs.</s> <s> That's unfortunate. "
    "Are they trying to lose weight or are they just trying to be healthier?</s> "
    "<s> I'm not sure."
)
UTTERANCE = "My friends are cool but they eat too many carbs."

_ = tokenizer([UTTERANCE], return_tensors="pt")
_ = tokenizer_fast([UTTERANCE], return_tensors="pt")

inputs = tokenizer([NEXT_UTTERANCE], return_tensors="pt")

inputs_fast = tokenizer_fast([NEXT_UTTERANCE], return_tensors="pt")

# check that the fast tokenizer is the same as the slow one
assert torch.all(inputs.input_ids == inputs_fast.input_ids)

next_reply_ids = model.generate(**inputs)
next_reply_ids_fast = model.generate(**inputs_fast)

assert torch.all(inputs.input_ids == inputs_fast.input_ids)

print(tokenizer.batch_decode(next_reply_ids))
>>> I see. Well, it's good that they're trying to change their eating habits.
print(tokenizer_fast.batch_decode(next_reply_ids_fast))
>>> I see. Well, it's good that they're trying to change their eating habits.

Expected behavior

Both generations should be the same ideally!

cc @ydshieh @ArthurZucker

younesbelkada avatar Jan 25 '23 16:01 younesbelkada

Hi @younesbelkada

It would be nice if you also show inputs and inputs_fast (we can definitely check ourselves), or mention if this is the same or not :-)

ydshieh avatar Jan 25 '23 16:01 ydshieh

Thanks a lot! I have updated the description with more details

younesbelkada avatar Jan 25 '23 17:01 younesbelkada

I'll have a look but the fact that the second scripts works well is already good. Will check that all the inputs_ids and generated_ids are the same

ArthurZucker avatar Jan 25 '23 17:01 ArthurZucker

@ArthurZucker I wanted to work on this issue, I did little more digging and found out that this issue (difference in input_ids by the tokenizer) happens when <s> is not followed by a space. The 2nd script works as the is space between <s> and next character.

raghavanone avatar Jan 26 '23 13:01 raghavanone

This could mean that the clean_up_tokenization_space or spaces_between_special_tokens args don't have the same values in the different models.

ArthurZucker avatar Jan 26 '23 14:01 ArthurZucker

Okay, Let me dig further in this direction.

raghavanone avatar Jan 26 '23 15:01 raghavanone

You can now control the clean_up_tokenization_space parameter when initialising a model (merged in #22341) which should have fixed this issue (need to update the param)

ArthurZucker avatar May 26 '23 09:05 ArthurZucker