transformers
transformers copied to clipboard
Different behavior in DistilBERT when using "inputs_embeds"
System Info
-
transformers
version: 4.24.0 - Platform: Linux-3.10.0-1160.80.1.0.1.el7.x86_64-x86_64-with-glibc2.27
- Python version: 3.10.8
- Huggingface_hub version: 0.11.1
- PyTorch version (GPU?): 1.12.1 (True)
@ArthurZucker @younesbelkada
Reproduction
Do a forward pass with distilBERT passing "inputs_embs" instead of "input_ids", where "inputs_embs" contains the output of the forward over the word embedding matrix, i.e. just picking the token embeddings. When doing this, one would expect the same behaviour as other popular models like BERT or ROBERTA, but distilBERT skips "positional embeddings addition + layerNorm + droput" as it skips the self.embedding() call.
I would expect that passing inputs_embs instead of input_ids has a coherent behaviour across diferent architectures, but it is not the case, at least for distilBERT. I am not sure if other models have this issue.
Expected behavior
Properly build the input embedding by adding positional embeddings + layerNorm + droput (as happens in modeling_bert or modeling_roberta). This does not happen as the call to self.embedding() is skipped.
Hey, can you provide a reproducing script ?
@ArthurZucker I don't think that I'll have soon time to prepare a reproducing script. But, I'm happy to give more details on the issue found.
In distilbert when input_embeds is not None, the self.embedding layer is skipped completely: https://github.com/huggingface/transformers/blob/main/src/transformers/models/distilbert/modeling_distilbert.py
if inputs_embeds is None:
inputs_embeds = self.embeddings(input_ids) # (bs, seq_length, dim)
return self.transformer(
x=inputs_embeds,
attn_mask=attention_mask,
head_mask=head_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
However, in BERT the self.embedding() call always happens and does the following: https://github.com/huggingface/transformers/blob/7d2a5fa749d22f403fe6ceac7d62c003240aee45/src/transformers/models/bert/modeling_bert.py
if inputs_embeds is None:
inputs_embeds = self.word_embeddings(input_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
embeddings = inputs_embeds + token_type_embeddings
if self.position_embedding_type == "absolute":
position_embeddings = self.position_embeddings(position_ids)
embeddings += position_embeddings
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
So, when passing input_embeds the call to self.word_embeddings() does not take place, so assuming that input_embeds are already word embbedings, but positional embedding addition, layer normalization and dropout happen after.
In summary, passing input_embeds to BERT (and other architectures) assumes that the input_embeds are the word embeddings. However, in DistilBERT if you pass the word embeddings as input_embeds you would be skipping adding positional embeddings, layer norm and dropout. The docs do not say anything about this difference and does not seem reasonable having to do the skipped operations manually before passing input_embeds to distilBERT.
Okay, this does not really need a reproduction script and I agree with you, the expected behaviour is that if you pre-compute the input_embeds
, the output of the model should certainly be the same as if you were passing the input ids. This is true for most of our models. I have to check whether there is a particular reason for this in DistillBert, otherwise will push a fix!
Great news @ArthurZucker !