transformers
transformers copied to clipboard
Weird behavior for initial tokens in BERT Base Cased
System Info
transformers version: 4.27.4 python version: 3.8.8
Who can help?
@ArthurZucker @younesbelkada
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
I'm running a simple MLM task using BERT Base Cased. I'm noticing weird behavior when decoding the first token (after the CLS token) in the output. Here's an example:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
model = AutoModelForMaskedLM.from_pretrained('bert-base-cased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
inputs = tokenizer(['The laws have done [MASK] harm.'], return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
tokenizer.batch_decode(torch.argmax(outputs.logits, dim=-1))
This produces the output: .. laws have done no harm..
. I know the first and last dots correspond to predictions for the CLS and EOS tokens, so they should be ignored, but the second dot is where The
should be. This happens with a variety of words in many sentences, but it doesn't always happen for the same words. It does seem to be paying attention to this initial word even when it is not produced, since the results differ depending on the initial word, even if it's not decoded from the output. But it looks weird. Is this normal behavior?
When I use the fill-mask pipeline, I get a different result, but I'm assuming that the pipeline just internally uses string replacement for the mask token rather than actually decoding the full output.
from transformers import pipeline
pipe = pipeline('fill-mask', 'bert-base-cased')
pipe('The laws have done [MASK] harm.')[0]['sequence']
Produces The laws have done no harm.
, as expected.
Expected behavior
I'd expect that given tokens would be retained as is, for the most part. Sentence initial The
and I
seem to cause this problem a lot, which is odd, given I'd expect those to be well-attested in the training data.