gluon-nlp [Numpy][Pretrained Model] Add functionality to compress the vocabulary or add new special tokens for fast knowledge transfer

[Numpy][Pretrained Model] Add functionality to compress the vocabulary or add new special tokens for fast knowledge transfer

Open sxjscience opened this issue 4 years ago • 4 comments

When applying pretrained models on real datasets, we often need to adapt the tokenizer and ensure that we can appropriately transfer the knowledge:

Case 1: Trim the vocabulary

For example, multilingual models such as XLMR have a tremendously large vocabulary size. Sometimes, you will just want to train a model that works for English and Spanish and not all the languages supported by XLMR. Here, we may trim the vocabulary to only keep the tokens that are related to English and Spanish.
Case 2: Add new tokens into the vocabulary

You may have some special tokens other than [CLS] and [SEP] that have special meanings in the downstream application and you'd like to add more reserved tokens to the existing tokenizer.

Thus, we should consider about how to support both use cases.

@garima3292 Please correct me if I'm wrong here in describing the use-cases.

Aug 22 '20 06:08 sxjscience

I think we discussed some of such use cases before, and the conclusion is that it can be easily supported by constructing a new vocabulary object. I don't think we want to open the door of mutability for vocab because it can be error prone when getting into distributed training.

Aug 22 '20 06:08 szha

@szha This is not about the vocabulary object itself and is more about how you should revise the pretrained model, e.g., BERT, XLMR. In the implementation, we should still keep vocabulary to be immutable.

Aug 22 '20 06:08 sxjscience

In the previous TokenEmbedding implementation there was the logic for shuttling and pruning weight for new vocab. I think we just need to think of a way to offer it in the new API

Aug 22 '20 07:08 szha

@sxjscience Yes, both of these use-cases are important and make sense. Currently, any modifications in vocabulary should be accompanied by explicitly iterating through model params (embedding weights or decoder weights) and modifying them for new model (by copying old ones and adding any new randomly initialized params in case of vocab addition). We already have some working code for both vocab addition and vocab pruning that we can share, adapt and probably see how these can be offered through APIs.

Aug 25 '20 19:08 garima3292

gluon-nlp gluon-nlp copied to clipboard

[Numpy][Pretrained Model] Add functionality to compress the vocabulary or add new special tokens for fast knowledge transfer

gluon-nlp
gluon-nlp copied to clipboard