transformers Different behavior in DistilBERT when using "inputs

System Info

transformers version: 4.24.0
Platform: Linux-3.10.0-1160.80.1.0.1.el7.x86_64-x86_64-with-glibc2.27
Python version: 3.10.8
Huggingface_hub version: 0.11.1
PyTorch version (GPU?): 1.12.1 (True)

@ArthurZucker @younesbelkada

Reproduction

Do a forward pass with distilBERT passing "inputs_embs" instead of "input_ids", where "inputs_embs" contains the output of the forward over the word embedding matrix, i.e. just picking the token embeddings. When doing this, one would expect the same behaviour as other popular models like BERT or ROBERTA, but distilBERT skips "positional embeddings addition + layerNorm + droput" as it skips the self.embedding() call.

I would expect that passing inputs_embs instead of input_ids has a coherent behaviour across diferent architectures, but it is not the case, at least for distilBERT. I am not sure if other models have this issue.

Expected behavior

Properly build the input embedding by adding positional embeddings + layerNorm + droput (as happens in modeling_bert or modeling_roberta). This does not happen as the call to self.embedding() is skipped.

Jan 11 '23 15:01 DiegoOrtego

Hey, can you provide a reproducing script ?

Jan 13 '23 08:01 ArthurZucker

@ArthurZucker I don't think that I'll have soon time to prepare a reproducing script. But, I'm happy to give more details on the issue found.

In distilbert when input_embeds is not None, the self.embedding layer is skipped completely: https://github.com/huggingface/transformers/blob/main/src/transformers/models/distilbert/modeling_distilbert.py

        if inputs_embeds is None:
            inputs_embeds = self.embeddings(input_ids)  # (bs, seq_length, dim)
        return self.transformer(
            x=inputs_embeds,
            attn_mask=attention_mask,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

However, in BERT the self.embedding() call always happens and does the following: https://github.com/huggingface/transformers/blob/7d2a5fa749d22f403fe6ceac7d62c003240aee45/src/transformers/models/bert/modeling_bert.py


        if inputs_embeds is None:
            inputs_embeds = self.word_embeddings(input_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = inputs_embeds + token_type_embeddings
        if self.position_embedding_type == "absolute":
            position_embeddings = self.position_embeddings(position_ids)
            embeddings += position_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)

So, when passing input_embeds the call to self.word_embeddings() does not take place, so assuming that input_embeds are already word embbedings, but positional embedding addition, layer normalization and dropout happen after.

In summary, passing input_embeds to BERT (and other architectures) assumes that the input_embeds are the word embeddings. However, in DistilBERT if you pass the word embeddings as input_embeds you would be skipping adding positional embeddings, layer norm and dropout. The docs do not say anything about this difference and does not seem reasonable having to do the skipped operations manually before passing input_embeds to distilBERT.

Jan 27 '23 08:01 DiegoOrtego

Okay, this does not really need a reproduction script and I agree with you, the expected behaviour is that if you pre-compute the input_embeds, the output of the model should certainly be the same as if you were passing the input ids. This is true for most of our models. I have to check whether there is a particular reason for this in DistillBert, otherwise will push a fix!

Feb 23 '23 08:02 ArthurZucker

Great news @ArthurZucker !

Feb 24 '23 08:02 DiegoOrtego

transformers
transformers copied to clipboard

Different behavior in DistilBERT when using "inputs_embeds"

System Info

Reproduction

Expected behavior

transformers transformers copied to clipboard

Different behavior in DistilBERT when using "inputs_embeds"

System Info

Reproduction

Expected behavior

transformers
transformers copied to clipboard