gluon-nlp icon indicating copy to clipboard operation
gluon-nlp copied to clipboard

Difference between embeddings Gluon and Huggingface

Open evah88 opened this issue 4 years ago • 15 comments

We have a BERT model that we trained from scratch on a proprietary dataset using Huggingface. I'm trying to port it to the GluonNLP version of BERT and roughly followed the conversion script. Specifically, we found the matching parameter names and then copied the model weights. The output of the converted gluon model is different from our original huggingface model so I'm trying to debug.

To simplify I calculated the embeddings of the sentence "Hello, my dog is cute" using the pretrained BERT models from GluonNLP and Huggingface and the encodings are different as well.

Code to calculate GluonNLP embeddings:

import gluonnlp as nlp; import mxnet as mx;
model, vocab = nlp.model.get_model('bert_12_768_12', dataset_name='book_corpus_wiki_en_uncased', use_classifier=False, use_decoder=False);
tokenizer = nlp.data.BERTTokenizer(vocab, lower=True);
transform = nlp.data.BERTSentenceTransform(tokenizer, max_seq_length=512, pair=False, pad=False);
sample = transform(['Hello, my dog is cute']);
words, valid_len, segments = mx.nd.array([sample[0]]), mx.nd.array([sample[1]]), mx.nd.array([sample[2]]);
seq_encoding, cls_encoding = model(words, segments, valid_len);

Code to calculate Huggingface embeddings:

from transformers import BertModel, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
%timeit outputs = model(input_ids)
outputs = model(input_ids)

last_hidden_states = outputs[0] 

The result is that seq_encoding and last_hidden_states are very different. Any suggestions on what we're missing?

evah88 avatar Apr 12 '20 11:04 evah88

I think gluonnlp's vocab has different order from the one of huggingface. So it will be problematic if we simply copy the entire embedding matrix. @eric-haibin-lin.

szhengac avatar Apr 13 '20 03:04 szhengac

The word embeddings of the inputs are identical so the difference happens somewhere downstream

evah88 avatar Apr 13 '20 05:04 evah88

For the bert base uncased model, the vocab mapping is different, so the embedding weights need to be shuffled accordingly. @evah88 did you print the word ids in the batch and compare the two?

eric-haibin-lin avatar Apr 13 '20 18:04 eric-haibin-lin

@eric-haibin-lin please take a look here: https://colab.research.google.com/drive/1eZAHtpAP5bzz4PA_gX5HeJyiB0GtaPWX#scrollTo=-5X58oHMFu9P

It looks like the token IDs are the same for Gluon and HuggingFace.

devsentient avatar Apr 14 '20 00:04 devsentient

The bos and eos token ids are different: HF: [101, 7592, 1010, 2026, 3899, 2003, 10140, 102] Gluon: [ 2, 7592, 1010, 2026, 3899, 2003, 10140, 3]

eric-haibin-lin avatar Apr 14 '20 20:04 eric-haibin-lin

yes, but these are the same tokens and we pass the sentence through the gluon and HF tokenizers respectively so the output should be the same? Why would the same sentence encode differently?

devsentient avatar Apr 14 '20 20:04 devsentient

i think @eric-haibin-lin means that embedding_gluonnlp[2] \not= embedding_hf[101], as you simply copied the embedding matrix without reordering.

szhengac avatar Apr 14 '20 20:04 szhengac

In this example we didn't copy the matrix. We loaded the pretrained models from each framework directly.

In our proprietary use case the token mapping is the same between HF and Gluon, and the parameter matrix is copied, but the output still different.

devsentient avatar Apr 14 '20 20:04 devsentient

@devsentient @evah88 I'm also trying to transfer model from huggingface to mxnet by matching parameter names, but I am transfering GPT 2. Initially I also got different results from hugginface and mxnet. It took me 2 days to figure out the reason... In GPT2 model, huggingface uses Conv1D layer to do matrix projection (x*weight+bias), while gluonnlp uses Dense layer. Thus, the weight matrix should be transposed when transferring the parameters. in gpt2, I did this in the following mapping

'transformer.h.(\d+).attn.c_proj.weight': '_self_attention_layers.{}._out_proj.weight'.

In BERT model, there might be some trick like this

carter54 avatar May 14 '20 12:05 carter54

Any updates on this? @carter54 @evah88 I'm also struggling with using the weights of BERT in gluon. I've found this approach here to convert from DistilBERT: https://nlp.gluon.ai/v0.9.x/model_zoo/conversion_tools/index.html and adjusted the mapping for BERT.

Does this mapping seem to be right? Mapping proposal from Gluon to Pytorch BERT Which weight layers need to be transposed? Do some need to be reversed?

andreas-solti avatar Jan 28 '21 09:01 andreas-solti

@andreas-solti the mapping looks correct. I don't think there's a need to transpose the weight. The embedding weight indices need to shuffle because:

The bos and eos token ids are different: HF: [101, 7592, 1010, 2026, 3899, 2003, 10140, 102] Gluon: [ 2, 7592, 1010, 2026, 3899, 2003, 10140, 3]

szha avatar Jan 29 '21 03:01 szha

@szha Thanks a lot for your feedback! Could you please elaborate on the shuffling of the weights? I've tried swapping position 2 and 101, and swapping 3 and 102 respectively. The embedding result is almost accurate, but not quite. The masked language model part also produces different ordering at lower probability results.

Would be really helpful!

andreas-solti avatar Feb 01 '21 16:02 andreas-solti

The embedding result is almost accurate, but not quite.

What's the largest difference? Is it larger than 1E-3, 1E-4 or 1E-5?

leezu avatar Feb 01 '21 16:02 leezu

It is larger than 1E-3 (Thanks @leezu for asking!): https://gist.github.com/andreas-solti/43db715d33cb0157b2c535b41dd4573c And the classification layer on top extrapolates these differences more. I found while mapping the parameter layers that the two layers in BERT are the same: word_embed.0.weight == decoder.3.weight Do both need to be "swapped"?

andreas-solti avatar Feb 01 '21 16:02 andreas-solti

Here is a reproducible example notebook that translates a german BERT to MXNet:

https://gist.github.com/andreas-solti/4222c389b8be139e597eccc8350c034b

The output classes look fine in terms of ordering. The weights and inputs being exaclty the same, I wonder where the smaller/larger differences come from.

andreas-solti avatar Feb 01 '21 20:02 andreas-solti