transformers Weird behavior for initial tokens in BERT Base Cased

Weird behavior for initial tokens in BERT Base Cased

Open mawilson1234 opened this issue 1 year ago • 0 comments

System Info

transformers version: 4.27.4 python version: 3.8.8

Who can help?

@ArthurZucker @younesbelkada

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I'm running a simple MLM task using BERT Base Cased. I'm noticing weird behavior when decoding the first token (after the CLS token) in the output. Here's an example:

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

model = AutoModelForMaskedLM.from_pretrained('bert-base-cased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

inputs = tokenizer(['The laws have done [MASK] harm.'], return_tensors='pt')

with torch.no_grad():
    outputs = model(**inputs)

tokenizer.batch_decode(torch.argmax(outputs.logits, dim=-1))

This produces the output: .. laws have done no harm... I know the first and last dots correspond to predictions for the CLS and EOS tokens, so they should be ignored, but the second dot is where The should be. This happens with a variety of words in many sentences, but it doesn't always happen for the same words. It does seem to be paying attention to this initial word even when it is not produced, since the results differ depending on the initial word, even if it's not decoded from the output. But it looks weird. Is this normal behavior?

When I use the fill-mask pipeline, I get a different result, but I'm assuming that the pipeline just internally uses string replacement for the mask token rather than actually decoding the full output.

from transformers import pipeline
pipe = pipeline('fill-mask', 'bert-base-cased')
pipe('The laws have done [MASK] harm.')[0]['sequence']

Produces The laws have done no harm., as expected.

Expected behavior

I'd expect that given tokens would be retained as is, for the most part. Sentence initial The and I seem to cause this problem a lot, which is odd, given I'd expect those to be well-attested in the training data.

Apr 25 '23 17:04 mawilson1234

transformers transformers copied to clipboard

Weird behavior for initial tokens in BERT Base Cased

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard