transformers icon indicating copy to clipboard operation
transformers copied to clipboard

convert fast tokenizers to slow

Open ahmedlone127 opened this issue 2 years ago • 1 comments

Feature request

Recently noticed that the models being uploaded now are only their fast versions and the sentencepeice model (that's included in the slow version) is missing. I need the sentence peice model of some tokenizers for a personal project and wanted to know what's the best way to go about that. After I looked through the current code on the repository I saw that there were a lot of methods for handling Conversion from Slow to Fast tokenization so I think it should be possible the other way around too. After a bit of research the only quick and dirty way I could think of was creating a utility script for converting the json files of the fast tokenizer to a the spe model format for a slow tokenizer because I think the information in both is the same so the mechanics should be similar too.

Motivation

I looked through the tokenizers and saw that most of them getting uploaded don't have slow tokenizers.

Your contribution

If there is any way I can help I would love to know , just need some guaidence on how to implement this!

ahmedlone127 avatar Jan 24 '23 20:01 ahmedlone127

I don't think it's possible to get the sentencepiece model from the tokenizer.json file but maybe @Narsil knows a way.

sgugger avatar Jan 24 '23 21:01 sgugger

hey @Narsil can you please give some insight on this?

ahmedlone127 avatar Jan 30 '23 18:01 ahmedlone127

You could try and create inverse scripts for the conversion you found. But it's not going to be trivial.

You need to create the protobuf sentencepiece expects.

Not sure I can provide much more guidance.

Why do you want slow tokenizers if I may ask?

Narsil avatar Jan 30 '23 18:01 Narsil

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Feb 24 '23 15:02 github-actions[bot]

hey @Narsil Thanks for the reply but I found a fix for my issue :)

ahmedlone127 avatar Feb 24 '23 15:02 ahmedlone127

Awesome. Do you mind explaining a little more or giving links for potential readers that would want to do the same?

Narsil avatar Feb 24 '23 15:02 Narsil

For Sure!

I noticed that you guys have code for converting a spm model ( A slow tokenizer ) to a tokenizer.json (fast tokenizer). I also noticed for some models you guys did not upload the SPM model even though it was an SPM based tokenizer. To get the SPM model from the tokenizer.json that was uploaded I had to figure out how to manually create an SPM model that had identical information as what's stored in the tokenizer.json

For example I had to copy the vocabulary , precompiled_charsmap , and other special tokens and manully edit a blank SPM file ( it already had the correct architecture and some dummy data that I removed while editing). Once all the information was copied over to the SPM file it was working as expected.

here is a notebook demonstrating the process

https://colab.research.google.com/drive/1kfC_iEuU0upVQ5Y3rnnl5VSngSPuiSQI?usp=sharing

ahmedlone127 avatar Feb 24 '23 21:02 ahmedlone127

@ahmedlone127 @Narsil Hey guys, so ive been training my tokenizers using spm. But however i am stuck as i am unable to figure out how to convert my sentencpiece.model to huggingface tokenizer (perferably fast tokenizer).

could you guys please link me all the resources on how could i do this ?

StephennFernandes avatar Nov 23 '23 22:11 StephennFernandes

Everything you need is here: https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py

There is no simple tutorial, there are many configurations in tokenizers that could achieve what you want, with various tradeoffs. What I recommend is running a diverse set of utf-8 + running all special tokens combinations that might be useful in your test suite to verify IDs do match.

Narsil avatar Nov 27 '23 10:11 Narsil

Hello @ahmedlone127 , I have the exact same needs to get the original SentencePiece tokenizer.model from tokenizer.json. Would you mind reshare your notebook please? The file no longer exits under this link. Much appreciate it. Thanks!

Derekglk avatar Mar 08 '24 07:03 Derekglk