gluon-nlp
gluon-nlp copied to clipboard
[Numpy][Pretrained Model] Add functionality to compress the vocabulary or add new special tokens for fast knowledge transfer
When applying pretrained models on real datasets, we often need to adapt the tokenizer and ensure that we can appropriately transfer the knowledge:
-
Case 1: Trim the vocabulary
For example, multilingual models such as XLMR have a tremendously large vocabulary size. Sometimes, you will just want to train a model that works for English and Spanish and not all the languages supported by XLMR. Here, we may trim the vocabulary to only keep the tokens that are related to English and Spanish.
-
Case 2: Add new tokens into the vocabulary
You may have some special tokens other than [CLS] and [SEP] that have special meanings in the downstream application and you'd like to add more reserved tokens to the existing tokenizer.
Thus, we should consider about how to support both use cases.
@garima3292 Please correct me if I'm wrong here in describing the use-cases.
I think we discussed some of such use cases before, and the conclusion is that it can be easily supported by constructing a new vocabulary object. I don't think we want to open the door of mutability for vocab because it can be error prone when getting into distributed training.
@szha This is not about the vocabulary object itself and is more about how you should revise the pretrained model, e.g., BERT, XLMR. In the implementation, we should still keep vocabulary to be immutable.
In the previous TokenEmbedding implementation there was the logic for shuttling and pruning weight for new vocab. I think we just need to think of a way to offer it in the new API
@sxjscience Yes, both of these use-cases are important and make sense. Currently, any modifications in vocabulary should be accompanied by explicitly iterating through model params (embedding weights or decoder weights) and modifying them for new model (by copying old ones and adding any new randomly initialized params in case of vocab addition). We already have some working code for both vocab addition and vocab pruning that we can share, adapt and probably see how these can be offered through APIs.