transformers
transformers copied to clipboard
convert fast tokenizers to slow
Feature request
Recently noticed that the models being uploaded now are only their fast versions and the sentencepeice model (that's included in the slow version) is missing. I need the sentence peice model of some tokenizers for a personal project and wanted to know what's the best way to go about that. After I looked through the current code on the repository I saw that there were a lot of methods for handling Conversion from Slow to Fast tokenization so I think it should be possible the other way around too. After a bit of research the only quick and dirty way I could think of was creating a utility script for converting the json files of the fast tokenizer to a the spe model format for a slow tokenizer because I think the information in both is the same so the mechanics should be similar too.
Motivation
I looked through the tokenizers and saw that most of them getting uploaded don't have slow tokenizers.
Your contribution
If there is any way I can help I would love to know , just need some guaidence on how to implement this!
I don't think it's possible to get the sentencepiece model from the tokenizer.json
file but maybe @Narsil knows a way.
hey @Narsil can you please give some insight on this?
You could try and create inverse scripts for the conversion you found. But it's not going to be trivial.
You need to create the protobuf sentencepiece expects.
Not sure I can provide much more guidance.
Why do you want slow tokenizers if I may ask?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
hey @Narsil Thanks for the reply but I found a fix for my issue :)
Awesome. Do you mind explaining a little more or giving links for potential readers that would want to do the same?
For Sure!
I noticed that you guys have code for converting a spm model ( A slow tokenizer ) to a tokenizer.json (fast tokenizer). I also noticed for some models you guys did not upload the SPM model even though it was an SPM based tokenizer. To get the SPM model from the tokenizer.json that was uploaded I had to figure out how to manually create an SPM model that had identical information as what's stored in the tokenizer.json
For example I had to copy the vocabulary , precompiled_charsmap , and other special tokens and manully edit a blank SPM file ( it already had the correct architecture and some dummy data that I removed while editing). Once all the information was copied over to the SPM file it was working as expected.
here is a notebook demonstrating the process
https://colab.research.google.com/drive/1kfC_iEuU0upVQ5Y3rnnl5VSngSPuiSQI?usp=sharing
@ahmedlone127 @Narsil Hey guys, so ive been training my tokenizers using spm. But however i am stuck as i am unable to figure out how to convert my sentencpiece.model to huggingface tokenizer (perferably fast tokenizer).
could you guys please link me all the resources on how could i do this ?
Everything you need is here: https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py
There is no simple tutorial, there are many configurations in tokenizers
that could achieve what you want, with various tradeoffs.
What I recommend is running a diverse set of utf-8 + running all special tokens combinations that might be useful in your test suite to verify IDs do match.
Hello @ahmedlone127 , I have the exact same needs to get the original SentencePiece tokenizer.model from tokenizer.json. Would you mind reshare your notebook please? The file no longer exits under this link. Much appreciate it. Thanks!