tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Extend tokenizer vocabulary with new words

Open anferico opened this issue 4 years ago • 12 comments

Suppose I have a pre-trained tokenizer, e.g. a BertWordPieceTokenizer, with its own vocabulary. My goal is to use it to tokenize some technical text which will likely contain unknown words (represented as "[UNK]" tokens).

Is there a way to fine-tune the tokenizer so that unknown words are automatically added to its vocabulary? I have found similar issues in the transformers repository (transformers/issues/2691 and transformers/issues/1413), but what they suggest is to manually add unknown tokens, whereas I would like them to be added automatically.

Here's a pseudo-code representation of what I would need:

pre_trained_tokenizer = ...
vocab = pre_trained_tokenizer.get_vocab()

technical_text = [
  'some text with unknown words',
  'some other text with unknown words',
  ...
]

updated_tokenizer = pre_trained_tokenizer.train(
  technical_text,
  initial_vocabulary=vocab
)

new_vocab = updated_tokenizer.get_vocab()  # 'new_vocab' contains all words in 'vocab' plus some new words

Can I do that with huggingface/tokenizers and/or huggingface/transformers? I thought it would be an easy thing to do, but I wasn't able to find anything useful.

anferico avatar Feb 11 '21 15:02 anferico

No that's not possible, you'll have to add the tokens manually indeed.

n1t0 avatar Feb 12 '21 17:02 n1t0

Thanks for the reply. Just to clarify, is it a missing feature of the library or is it a limitation of the tokenization algorithm?

anferico avatar Feb 12 '21 19:02 anferico

It depends on the specific tokenization algorithm, but the tokenizer doesn't save all the training state that would be needed to pick up the training back where it was initially left.

n1t0 avatar Feb 18 '21 22:02 n1t0

Most off-the-shelf models have plenty of unused vocabulary entries that you could repurpose:

  1. Train a new vocabulary on the target domain corpus from scratch
  2. Find the new vocabulary entries that are not in the old vocabulary
  3. If the number of new entries is outside the desired range change settings and/or corpus and repeat from step 1
  4. Replace the last unused entries with the new entries
  5. Fine-tune as usual

If your application needs some unused entries for itself you must of course leave a sufficient number of such entries.

jowagner avatar Feb 23 '21 15:02 jowagner

Here's a pseudo-code representation of what I would need:

pre_trained_tokenizer = ...
vocab = pre_trained_tokenizer.get_vocab()

technical_text = [
  'some text with unknown words',
  'some other text with unknown words',
  ...
]

updated_tokenizer = pre_trained_tokenizer.train(
  technical_text,
  initial_vocabulary=vocab
)

new_vocab = updated_tokenizer.get_vocab()  # 'new_vocab' contains all words in 'vocab' plus some new words

Hi @anferico , I don't know if this is what you were looking for, but this could be a possible approach for your problem:

  1. First, you need to extract tokens out of your data while applying the same preprocessing steps used by the tokenizer. To do so you can just use the tokenizer itself: new_tokens = tokenizer.basic_tokenizer.tokenize(' '.join(technical_text))
  2. Now you just add the new tokens to the tokenizer vocabulary: tokenizer.add_tokens(new_tokens) This method only adds new tokens, which means you don't have to worry about words already present in the tokenizer's vocab.

The result would be the tokenizer with your specific domain tokens along with the original tokenizer's vocabulary. Of course, you can just encapsulate this in a function and use it like you do in your pseudocode.

Remember that for your model to work, you will need to update the embedding layer with the new augmented vocabulary: model.resize_token_embeddings(len(tokenizer))

Hope it helps!! :)

juanjucm avatar Mar 01 '21 11:03 juanjucm

I don't think .add_tokens() is implemented. https://github.com/huggingface/transformers/blob/cad61b68396a1a387287a8e2e2fef78a25b79383/src/transformers/tokenization_utils_base.py#L952

dplaniel avatar May 06 '22 04:05 dplaniel

I don't think .add_tokens() is implemented. https://github.com/huggingface/transformers/blob/cad61b68396a1a387287a8e2e2fef78a25b79383/src/transformers/tokenization_utils_base.py#L952

You are pointing to a base class, so yes it's not implemented.

Real class: https://github.com/huggingface/transformers/blob/cad61b68396a1a387287a8e2e2fef78a25b79383/src/transformers/tokenization_utils_fast.py#L264

Narsil avatar May 06 '22 08:05 Narsil

I think this tutorial shared by the official can help with your question Training a new tokenizer from an old one

harveyaot avatar Jun 25 '23 04:06 harveyaot