demo: porting to transformers v4.2.2

Open kevinalexmathews opened this issue 4 years ago • 1 comments

demo_model.py works well with transformers v2.3.0. However when I try a more recent version, I get the embedding error "IndexError: index out of range in self".

I localized the error to the value of tokenizer.additional_special_tokens_ids, which is [100] in v2.3.0 and [30523] in v4.2.2. If I replace all occurrences of 30523 with 100, I am able to reproduce the results. But this is just a dirty fix. Is there any elegant solution? Am I missing anything with respect to the tokenizer?

Mar 29 '21 17:03 kevinalexmathews

Hi Kevin, thanks for noticing and narrowing down the version incompatibility issue!

So after doing some detective work, here's what I think happened to the tokenizers in different version of the transformers module:

The older version (v2.3.0) did not include the uppercase '[TGT]' in its special token mappings (token.added_token_encoder) but did so in the special token list (tokenizer.additional_special_tokens). This causes the uppercase '[TGT]' to be automatically mapped to id 100 (instead of 30523) when tokenizer.additional_special_tokens_ids is called; the embedding at position 100 is what get tuned during the training process.
The newer version includes both '[TGT]' and '[tgt]' in token.added_token_encoder. It assigns id 30523 to the '[TGT]' token, which caused the index out of range error because there is no position 30523 in the learned embeddings (which was trained in v2.3.0).

One way to "patch" this is to manually map the token by adding this line tokenizer.added_tokens_encoder['[TGT]'] = 100 after the tokenizer = BertTokenizer.from_pretrained(args.model_dir) line, which is exactly what you suggested. This is to me the most elegant solution right now without retraining the whole model in new version.

Apr 01 '21 11:04 BPYap