tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Convert saved pretrained tokenizers from transformers to tokenizers

Open NonaryR opened this issue 5 years ago • 6 comments

Hello! I'm using BertTokenizer from the transformers library and added some special tokens for my case. After that, I save it as save_pretrained, which produces added_tokens.json, special_tokens_map.json, tokenizer_config.json and vocab.txt in a directory that I create. It will be super nice if I can pass that directory with all those configs to BertWordPieceTokenizer from the tokenizers library. Can I do this somehow?

NonaryR avatar Apr 11 '20 22:04 NonaryR

Hi, I have a similar issue. i trained BPE tokenizer and now when I am loading it using transformers AutoTokenizer module, it is giving me an error.

muhammadfahid51 avatar Apr 17 '20 08:04 muhammadfahid51

There's no easy way for now. This will be possible, as soon as we have https://github.com/huggingface/tokenizers/issues/15

n1t0 avatar Apr 22 '20 17:04 n1t0

Use the model name like if you are pretraining Roberta, use roberta_model.from_pretrained for loading your custom model. Similarly use RobertaTokenizer for loading your custom tokenizer.

muhammadfahid51 avatar Apr 22 '20 17:04 muhammadfahid51

@n1t0 With version 0.8 is there a way to perform the conversion from pretrained/slow tokenizer to fast tokenizer? Even just a manual procedure to convert a binary file like sentencepiece.bpe.model to the right format? (https://github.com/huggingface/tokenizers/issues/291? https://github.com/huggingface/tokenizers/blob/master/bindings/python/scripts/sentencepiece_extractor.py ?) The use case would be to accelerate tokenization during inference on a pretrained model.

pommedeterresautee avatar Jun 29 '20 11:06 pommedeterresautee

There is no easy way at the moment. For tokenizers that use a BPE, you can probably do it manually in some cases, but you will need to dig into how the tokenizer works to do so. We don't have a guide to follow to do this, unfortunately.

Now, when we can support a fast version of the tokenizers we provide in transformers, you can be sure that we convert them, and make them available through transformers. And now that #15 has been merged, we will soon be able to provide files that can be used both in transformers with the fast tokenizers, and with the tokenizers library directly. This is a work in progress though.

If you are interested in converting a SentencePiece tokenizer, you should make sure it is a BPE first, then you can maybe use the script you mentioned to convert it and use it with SentencePieceBPETokenizer. But I think most of the SentencePiece tokenizers use Unigram that is not entirely supported for now (cf #292 for a work in progress).

n1t0 avatar Jun 29 '20 15:06 n1t0

Tks a lot for the answer, I am targeting XLM-ROBERTA, according to its paper it's the Unigram algo which has been used. So I will wait for #292 :-)

pommedeterresautee avatar Jun 29 '20 19:06 pommedeterresautee

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 28 '24 01:05 github-actions[bot]