tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Fine-tune a BPE tokenize by only adding merge rules

Open cccntu opened this issue 2 years ago • 6 comments

Fine-tune a BPE tokenize by only adding merge rules

only add, no remove

Motivation

I want to update a GPT2 tokenizer on my corpus, without manually adding special tokens.

The current code doesn't seem to support that.

https://huggingface.co/docs/tokenizers/python/latest/api/reference.html?highlight=initial_alphabet#tokenizers.trainers.BpeTrainer initial_alphabet (List[str], optional) – A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.

possible implementation

  • use the existing interface
k = num_new_tokens = 100
new_tokenizer = tokenizer.train_new_from_iterator(
    batch_iterator(), vocab_size=len(old_vocab)+k, initial_alphabet=list(old_vocab.keys())
)

from my understanding, using a existing vocab to train is just starting at this step, with 'ug' as existing vocabulary:

("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)

https://huggingface.co/docs/transformers/tokenizer_summary#bytepair-encoding-bpe

alternative solution

  • Adding special tokens manually, but this requires manually decide what to add.
  • Write code to implement BPE training algorithm, then add new tokens as special tokens.

cccntu avatar Dec 03 '21 15:12 cccntu

It seems like a sound idea, since the merge table is fused in sorted order, we should in theory be able to append only from partial results (read -> rencoded dataset with current merge table, and resume appending to merge table in order).

This seems like a non negligible amount of work. Isn't retraining with larger vocab size good enough ?

Note: This definitely wouldn't be called initial_alphabet since it serves a different purpose, and you would also need to feed the merge table in some form.

Narsil avatar Dec 29 '21 09:12 Narsil

Note: This definitely wouldn't be called initial_alphabet since it serves a different purpose, and you would also need to feed the merge table in some form.

I see that now.

This seems like a non negligible amount of work. Isn't retraining with larger vocab size good enough ?

I agree that it's non negligible amount of work. However, retraining with larger vocab can lose some of the benefits of pre-training, assuming the token embeddings are re-initialized. I once tried re-training the tokenizer for a task and it's actually worse than not doing it.

On that note, we may want to initialize the embedding of the new merged tokens using the mean of the components.

cccntu avatar Dec 30 '21 13:12 cccntu

Hi @cccntu ,

The tokenizer never knows anything about the embeddings. It has no clue whatsoever.

AFAIK, fine-tuning a model, you never touch the tokenizer itself. If you do, and add words/tokens in the vocabulary, then those embeddings will necessarily be empty, and it will take time for them to be trained to the level of the fine-tuned ones anyway.

Retraining the tokenizer implies retraining the whole model for sure. But fine-tuning the tokenizer is unlikely to produce good results either for the new tokens (at least in as short training time as fine-tuning is usually able to get by).

Narsil avatar Dec 30 '21 15:12 Narsil

Hi @Narsil ,

I agree adding new tokens would require longer training times to actually have benefits. But that happens to be the scenario I have in mind 😃 .

Actually I would be interested to look at the code, but I only found a rust version and I don't speak rust. Is there a compatible python version?

cccntu avatar Jan 06 '22 15:01 cccntu

Unfortunately, no, this library is wrote in rust, exclusively for performance.

There's also no clean way I can point you toward the code either since this would touch quite many places as it breaks an assumption (that once a model is trained, it is not modified afterwards).

Here is the core of BPE training algorithm: https://github.com/huggingface/tokenizers/blob/master/tokenizers/src/models/bpe/trainer.rs#L420

Another way you could take a stab at it, is write a Python version of the update, and simply save it back as the standard .json file so that this library can import it again. wdyt ?

Narsil avatar Jan 06 '22 15:01 Narsil

Thanks a lot for the suggestion. That's a good idea. I'll give it a try some time.

cccntu avatar Jan 08 '22 05:01 cccntu

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Mar 08 '24 01:03 github-actions[bot]