tokenizers
tokenizers copied to clipboard
Extend tokenizer vocabulary with new words
Suppose I have a pre-trained tokenizer, e.g. a BertWordPieceTokenizer, with its own vocabulary. My goal is to use it to tokenize some technical text which will likely contain unknown words (represented as "[UNK]" tokens).
Is there a way to fine-tune the tokenizer so that unknown words are automatically added to its vocabulary? I have found similar issues in the transformers repository (transformers/issues/2691 and transformers/issues/1413), but what they suggest is to manually add unknown tokens, whereas I would like them to be added automatically.
Here's a pseudo-code representation of what I would need:
pre_trained_tokenizer = ...
vocab = pre_trained_tokenizer.get_vocab()
technical_text = [
'some text with unknown words',
'some other text with unknown words',
...
]
updated_tokenizer = pre_trained_tokenizer.train(
technical_text,
initial_vocabulary=vocab
)
new_vocab = updated_tokenizer.get_vocab() # 'new_vocab' contains all words in 'vocab' plus some new words
Can I do that with huggingface/tokenizers and/or huggingface/transformers?
I thought it would be an easy thing to do, but I wasn't able to find anything useful.
No that's not possible, you'll have to add the tokens manually indeed.
Thanks for the reply. Just to clarify, is it a missing feature of the library or is it a limitation of the tokenization algorithm?
It depends on the specific tokenization algorithm, but the tokenizer doesn't save all the training state that would be needed to pick up the training back where it was initially left.
Most off-the-shelf models have plenty of unused vocabulary entries that you could repurpose:
- Train a new vocabulary on the target domain corpus from scratch
- Find the new vocabulary entries that are not in the old vocabulary
- If the number of new entries is outside the desired range change settings and/or corpus and repeat from step 1
- Replace the last unused entries with the new entries
- Fine-tune as usual
If your application needs some unused entries for itself you must of course leave a sufficient number of such entries.
Here's a pseudo-code representation of what I would need:
pre_trained_tokenizer = ... vocab = pre_trained_tokenizer.get_vocab() technical_text = [ 'some text with unknown words', 'some other text with unknown words', ... ] updated_tokenizer = pre_trained_tokenizer.train( technical_text, initial_vocabulary=vocab ) new_vocab = updated_tokenizer.get_vocab() # 'new_vocab' contains all words in 'vocab' plus some new words
Hi @anferico , I don't know if this is what you were looking for, but this could be a possible approach for your problem:
- First, you need to extract tokens out of your data while applying the same preprocessing steps used by the tokenizer. To do so you can just use the tokenizer itself:
new_tokens = tokenizer.basic_tokenizer.tokenize(' '.join(technical_text)) - Now you just add the new tokens to the tokenizer vocabulary:
tokenizer.add_tokens(new_tokens)This method only adds new tokens, which means you don't have to worry about words already present in the tokenizer's vocab.
The result would be the tokenizer with your specific domain tokens along with the original tokenizer's vocabulary. Of course, you can just encapsulate this in a function and use it like you do in your pseudocode.
Remember that for your model to work, you will need to update the embedding layer with the new augmented vocabulary:
model.resize_token_embeddings(len(tokenizer))
Hope it helps!! :)
I don't think .add_tokens() is implemented. https://github.com/huggingface/transformers/blob/cad61b68396a1a387287a8e2e2fef78a25b79383/src/transformers/tokenization_utils_base.py#L952
I don't think .add_tokens() is implemented. https://github.com/huggingface/transformers/blob/cad61b68396a1a387287a8e2e2fef78a25b79383/src/transformers/tokenization_utils_base.py#L952
You are pointing to a base class, so yes it's not implemented.
Real class: https://github.com/huggingface/transformers/blob/cad61b68396a1a387287a8e2e2fef78a25b79383/src/transformers/tokenization_utils_fast.py#L264
I think this tutorial shared by the official can help with your question Training a new tokenizer from an old one