Convert saved pretrained tokenizers from transformers to tokenizers
Hello! I'm using BertTokenizer from the transformers library and added some special tokens for my case. After that, I save it as save_pretrained, which produces added_tokens.json, special_tokens_map.json, tokenizer_config.json and vocab.txt in a directory that I create.
It will be super nice if I can pass that directory with all those configs to BertWordPieceTokenizer from the tokenizers library. Can I do this somehow?
Hi, I have a similar issue. i trained BPE tokenizer and now when I am loading it using transformers AutoTokenizer module, it is giving me an error.
There's no easy way for now. This will be possible, as soon as we have https://github.com/huggingface/tokenizers/issues/15
Use the model name like if you are pretraining Roberta, use roberta_model.from_pretrained for loading your custom model. Similarly use RobertaTokenizer for loading your custom tokenizer.
@n1t0 With version 0.8 is there a way to perform the conversion from pretrained/slow tokenizer to fast tokenizer?
Even just a manual procedure to convert a binary file like sentencepiece.bpe.model to the right format? (https://github.com/huggingface/tokenizers/issues/291? https://github.com/huggingface/tokenizers/blob/master/bindings/python/scripts/sentencepiece_extractor.py ?)
The use case would be to accelerate tokenization during inference on a pretrained model.
There is no easy way at the moment. For tokenizers that use a BPE, you can probably do it manually in some cases, but you will need to dig into how the tokenizer works to do so. We don't have a guide to follow to do this, unfortunately.
Now, when we can support a fast version of the tokenizers we provide in transformers, you can be sure that we convert them, and make them available through transformers. And now that #15 has been merged, we will soon be able to provide files that can be used both in transformers with the fast tokenizers, and with the tokenizers library directly. This is a work in progress though.
If you are interested in converting a SentencePiece tokenizer, you should make sure it is a BPE first, then you can maybe use the script you mentioned to convert it and use it with SentencePieceBPETokenizer. But I think most of the SentencePiece tokenizers use Unigram that is not entirely supported for now (cf #292 for a work in progress).
Tks a lot for the answer, I am targeting XLM-ROBERTA, according to its paper it's the Unigram algo which has been used.
So I will wait for #292 :-)
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.