tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Exception upon attempting to load a Tokenizer from file

Open joepalermo opened this issue 3 years ago • 31 comments

Hi, I'm attempting to simply serialize and then unserialize a trained tokenizer. When I run the following code:

tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(vocab_size=280)
tokenizer.train(trainer, ["preprocessing/corpus/corpus.txt"])
save_to_filepath = 'preprocessing/tokenizer.json'
tokenizer.save(save_to_filepath)
tokenizer = Tokenizer.from_file(save_to_filepath)

I get the following traceback:

Traceback (most recent call last):
...
    tokenizer = Tokenizer.from_file(save_to_filepath)
Exception: data did not match any variant of untagged enum ModelWrapper at line 1 column 5408

joepalermo avatar Dec 16 '20 21:12 joepalermo

Hi @joepalermo, would you mind sharing the resulting tokenizer.json file? It would be very helpful for us to debug this.

n1t0 avatar Jan 06 '21 15:01 n1t0

@n1t0 Thanks for your help.

GitHub isn't letting me attach a .json file to a comment, so I'll just paste the contents of it here:

{"version":"1.0","truncation":null,"padding":null,"added_tokens":[],"normalizer":null,"pre_tokenizer":null,"post_processor":null,"decoder":null,"model":{"dropout":null,"unk_token":null,"continuing_subword_prefix":null,"end_of_word_suffix":null,"fuse_unk":false,"vocab":{"\n":0," ":1,"(":2,")":3,"":4,"+":5,",":6,"-":7,".":8,"/":9,"0":10,"1":11,"2":12,"3":13,"4":14,"5":15,"6":16,"7":17,"8":18,"9":19,";":20,"=":21,"?":22,"C":23,"D":24,"F":25,"G":26,"I":27,"L":28,"S":29,"W":30,"a":31,"b":32,"c":33,"d":34,"e":35,"f":36,"g":37,"h":38,"i":39,"j":40,"k":41,"l":42,"m":43,"n":44,"o":45,"p":46,"q":47,"r":48,"s":49,"t":50,"u":51,"v":52,"w":53,"x":54,"y":55,"z":56," -":57,"e ":58,"t ":59," +":60," =":61," + ":62," - ":63,". ":64,";\n":65,"**":66,"Le":67,"Let ":68," = ":69,".;\n":70,"s ":71,"th":72," = -":73,"iv":74,"the ":75,"2":76,"r ":77,"of":78,". Let ":79,"d ":80,"?;\n":81,"at":82,"2":83,"of ":84,"3":85,"de":86,"or ":87,"4":88,"os":89,"pos":90,"(-":91,"5*":92,"Su":93,"ppos":94,"Suppos":95,"is ":96,"n ":97,"be ":98,"nd ":99,"co":100," a":101,"at ":102,"Wh":103,"What ":104,"ul":105," be ":106," - 1":107," + 1":108,"e -":109,"com":110,"3":111,"st ":112,") = ":113,"What is ":114,"ac":115,"act":116," f":117,"So":118,"lv":119,"Solv":120,"al":121,"ive ":122,") = -":123,"ate ":124,"mo":125,"commo":126,"common ":127,"in":128,"0":129,"Suppose ":130,"Cal":131,"cul":132,"Calcul":133,"Calculate ":134,"div":135,"divi":136," for ":137,"What is the ":138,"riv":139,"ative ":140,"deriv":141,"derivative ":142," and ":143,")/":144,"re":145,"or of ":146,"Is ":147,"). ":148,", ":149,"he":150,"im":151,"pr":152,"prim":153,"2 + ":154,"st common ":155,"fact":156,").;\n":157,"Suppose -":158,"Calculate the ":159," - 2":160,"6":161,"prime ":162," = 0":163," + 2":164,"Solve ":165,"2 - ":166,"or":167,", -":168,"derivative of ":169,"4":170,"10":171,"7":172,"ir":173,"y ":174,"r w":175,"d b":176,"ain":177,"main":178,"the prime ":179,"der w":180,"ded b":181,"is divi":182,"remain":183,"factor":184,"the prime factor":185,"der whe":186,"is divided b":187,"remainder whe":188,"the prime factors ":189,"12":190,"remainder when ":191,"the prime factors of ":192,"is divided by ":193,"min":194,"ti":195,"er":196," is divided by ":197,"Solve -":198,") be ":199,") be the ":200," w":201,"). Let ":202,"le ":203,"mul":204,"ple ":205," - 3":206,"tiple ":207,"multiple ":208,"rt ":209,"multiple of ":210,"8":211," + 3":212,"of -":213,"est common ":214,"11":215," a ":216," wrt ":217," - 2":218,"/2":219,". Suppose ":220," + 2":221,"(-2":222,". Is ":223,"9":224,". What is the ":225,"Fi":226,"Find ":227,"(-1":228,")?;\n":229," - 4":230,"/3":231,"derivative of -":232," + 4":233," - 3":234,"5":235,"eco":236,"seco":237,"second ":238," + 3":239,"0 = ":240,"0 = -":241,"Find the ":242," - -":243,"thir":244,"third ":245,"15":246,". Calculate the ":247,"13":248," + 4":249,"sor of ":250,"divisor of ":251," + -":252,"14":253," - 4*":254,"ghe":255,"hi":256,"ghest common ":257,"highest common ":258,". D":259,"no":260,"deno":261,"common deno":262,"minat":263,"common denominat":264,". Suppose -":265,"1*":266,"ar":267,"What ar":268,"What are ":269,"e?;\n":270,"16":271,"ber":272,"mber":273,"nu":274,"What are the prime factors of ":275,"mber?;\n":276,"number?;\n":277,"Li":278,"List ":279},"merges":[" -","e ","t "," +"," ="," + "," - ",". ","; \n","* ","L e","Le t "," = ",". ;\n","s ","t h"," = -","i v","th e ","2 ","r ","o f",". Let ","d ","? ;\n","a t"," 2","of ","3 ","d e","o r ","4 ","o s","p os","( -","5 ","S u","p pos","Su ppos","i s ","n ","b e ","n d ","c o"," a","a t ","W h","Wh at ","u l"," be "," - 1"," + 1","e -","co m"," 3","s t ",") = ","What is ","a c","ac t"," f","S o","l v","So lv","a l","iv e ",") = -","at e ","m o","com mo","commo n ","i n","0 ","Suppos e ","C al","c ul","Cal cul","Calcul ate ","d iv","div i"," f or ","What is the ","r iv","at ive ","de riv","deriv ative "," a nd ",") /","r e","or of ","I s ",") . ",", ","h e","i m","p r","pr im","2 + ","st common ","f act",") .;\n","Suppos e -","Calculate the "," - 2","6 ","prim e "," = 0"," + 2","Solv e ","2 - ","o r",", -","derivative of "," 4","1 0","7 ","i r","y ","r w","d b","a in","m ain","the prime ","de r w","de d b","is divi","re main","fact or","the prime factor","der w he","is divi ded b","remain der whe","the prime factor s ","1 2","remainder whe n ","the prime factors of ","is divided b y ","m in","t i","e r"," is divided by ","Solv e -",") be ",") be the "," w",") . Let ","l e ","m ul","p le "," - 3","ti ple ","mul tiple ","r t ","multiple of ","8 "," + 3","of -","e st common ","1 1"," a "," w rt "," - 2","/ 2",". Suppose "," + 2","(- 2",". Is ","9 ",". What is the ","F i","Fi nd ","(- 1",") ?;\n"," - 4","/ 3","derivative of -"," + 4"," - 3"," 5","e co","s eco","seco nd "," + 3","0 = ","0 = -","Find the "," - -","th ir","thir d ","1 5",". Calculate the ","1 3"," + 4","s or of ","divi sor of "," + -","1 4"," - 4","g he","h i","ghe st common ","hi ghest common ",". D","n o","de no","common deno","min at","common deno minat",". Suppose -","1 *","a r","What ar","What ar e ","e ?;\n","1 6","b er","m ber","n u","What are the prime factors of ","mber ?;\n","nu mber?;\n","L i","Li st "]}}

joepalermo avatar Jan 19 '21 21:01 joepalermo

This is really confusing because I don't think I'm doing anything unusual.

Also note, I tried unpickling the tokenizer object and it gives a similar error: Exception: Error while attempting to unpickle Tokenizer: data did not match any variant of untagged enum ModelWrapper at line 1 column 5304

joepalermo avatar Jan 19 '21 21:01 joepalermo

I've had the same issue. Try adding a pre_tokenizer:

from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=280)
tokenizer.train(trainer, ["preprocessing/corpus/corpus.txt"])
save_to_filepath = 'preprocessing/tokenizer.json'
tokenizer.save(save_to_filepath)
tokenizer = Tokenizer.from_file(save_to_filepath)

lukas-blecher avatar Jan 23 '21 17:01 lukas-blecher

any update to this problem? I've had the same issue

Hustcw avatar Apr 15 '21 07:04 Hustcw

Have you tried the solution proposed by @lukas-blecher to use a pre-tokenizer?

I believe this issue is related to this one: https://github.com/huggingface/tokenizers/issues/645

n1t0 avatar Apr 15 '21 19:04 n1t0

Have you tried the solution proposed by @lukas-blecher to use a pre-tokenizer?

I believe this issue is related to this one: #645

Yes, I've used a pre-tokenizer. I find this problem is caused by more than one spaces in tokenizer's merge mentioned in #645.

Hustcw avatar Apr 17 '21 11:04 Hustcw

Having same problem. I already have a pre-tokenizer added.

ejohb avatar Nov 30 '21 20:11 ejohb

Having same problem. I already have a pre-tokenizer added.

After some fiddling, the problem occurs only when I remove pre_tokenizers.Whitespace() and add pre_tokenizers.Split(pattern='\w+|[^\w\s]+', behavior='isolated') in its place.

ejohb avatar Dec 01 '21 14:12 ejohb

In case this might be of help to others: I was getting this error when using the SentenceTranformers library, and in my case upgrading tokenizers to version 0.10.3 fixed the issue:

pip install tokenizers==0.10.3

If anyone is getting this error, I recommend also taking a look at the dependency requirements (e.g., which version of the tokenizers libraries is required).

ruitedk6 avatar Jan 18 '22 13:01 ruitedk6

Yes, @ejohb is right. The problem occurs when using pre_tokenizers.Split() :/

duskybomb avatar May 02 '22 06:05 duskybomb

@duskybomb Does the problem still exist on latest 0.12.1 ? I can't seem to reproduce.

Narsil avatar May 02 '22 08:05 Narsil

@Narsil yes, it is still there in 0.12.1. The errror when I was trying to load: Exception: data did not match any variant of untagged enum ModelWrapper at line 59999 column 3. this is the pretokenizer i was using: tokenizer.pre_tokenizer = Split(pattern="<BREAK>", behavior="removed")

Also, I am not sure if this is desired or not -- but the vocab had <BREAK> merged with tokens despite using removed behavior. eg: <BREAK>small<BREAK>, with small being the actual token.

duskybomb avatar May 02 '22 09:05 duskybomb

Do you have a simple reproducible script ? here is the script I tried to use to reproduce, but it seems to be working properly

from tokenizers import trainers, models, Tokenizer, pre_tokenizers

tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(
    special_tokens=["<unk>", "<pad>", "<sep>"],
    vocab_size=8000,
)
tokenizer.pre_tokenizer = pre_tokenizers.Split(pattern="\w+|[^\w\s]+", behavior="isolated")
tokenizer.add_special_tokens(["<sep>"])
tokenizer.add_tokens(["<sep>"])


def iterator_over_seqs():
    with open("data/big.txt", "r") as f:
        for line in f:
            yield "ABCEFGH"


tokenizer.train_from_iterator(iterator=iterator_over_seqs(), trainer=trainer)
tokenizer.save("tok.json", pretty=True)
encoded = tokenizer.encode("ABCD<sep>EFGH")
tok = Tokenizer.from_file("tok.json")  # This is what is supposed to fail no ? It doesn't here.
print(encoded.ids)
```

Narsil avatar May 02 '22 09:05 Narsil

I also encountered the same problem, the json file is as follow, pleaseplease transform txt to json tokenizer-wiki.txt

yechong316 avatar Jul 29 '22 15:07 yechong316

hi @yechong316 ,

It seems your file contains merges which are not acceptable in the current deployed version of tokenizers.

Those merges contain multiple spaces: "e s " for instance (line 9499). This should not be doable within the library, hence the limitation. So it's normal if you created the merges yourselves in some manner.

  • If it was done within the library, a reproducible script would be super helpful to reproduce and fix.
  • In general this is not a limitation of the underlying BPE model but really a self imposed limitation within the library. We can definitely lift this limitation off (https://github.com/huggingface/tokenizers/pull/909 if you want to try it out, but it will need rewriting the merges in a different way). It's not currently merged, as changing anything regarding serialization requires a great deal of care to make sure we're not breaking anything in a backward incompatible way. But if there's enough attention for this feature, it definitely can be added !

Narsil avatar Aug 01 '22 09:08 Narsil

just to complement on Narsil: there are several "white space characters" usable in the tokenizer file, e.g. "Ġ" (unicode: ord("Ġ")=288) which in turn can be used in the merges

also, in case ye removed some of your vocab's, be sure all merges are still possible- in case some can't be resolved after altering it, it would throw the same error..

bashFish avatar Nov 09 '22 09:11 bashFish

Hi, I'm running into the same issue. However, I explicitly want to have multiple whitespaces in my merges. Could someone point me in the right direction on how I could do this?

nihirv avatar Nov 22 '22 20:11 nihirv

This is still an issue in 0.13.2.

To reproduce:

from tokenizers import Tokenizer, models, trainers

bpe_model = models.BPE(unk_token="[UNK]")
tokenizer = Tokenizer(model=bpe_model)
tokenizer.train_from_iterator(
    iterator=["test~ing lick~ing kick~ing"],
    trainer=trainers.BpeTrainer(),
)

path = "my_tokenizer.json"
tokenizer.save(path)

tok_loaded = Tokenizer.from_file(path)

In this particular case, tokenizer.pre_tokenizer = Whitespace() is a workaround.

davidgilbertson avatar Mar 07 '23 00:03 davidgilbertson

Have you checked out the PR that fixes it ? https://github.com/huggingface/tokenizers/pull/909

Which not going to merge anytime soon since it changes the on-disk format of the tokenizer, so we need a compelling reason for going through the pain of making this change.

If any model that requires it gets merged into transformers for instance, that would be a very valid reason !

In the meantime, the PR should work.

Narsil avatar Mar 07 '23 08:03 Narsil

Hi @Narsil : I think I've a very weird issue, which seems similar to the same above error stack trace in this issue. Here are the steps how it goes:

  1. So I trained an instance of custom XLMRobertaFast tokenizer from scratch on my multi-lingual corpus. Point to note is that I trained it on transformers-4.26.0 version on a python 3.7 conda environment in a different EC2 instance. After I had trained this tokenizer, in a separate script I had loaded the same using XLMRobertaTokenizerFast.from_pretrained() and it had worked fine without any errors.
  2. Now few days later, due to certain reasons I had to change my instance - I'm on a different instance that doesn't have python 3.7 and has python 3.6. So the latest version supported for python 3.6 is also transformers-4.18.0 which is installed on this instance. Now when I'm trying to load the same saved tokenizer which loaded perfectly with the 4.26.0 version as mentioned above, is failing now when loaded with the same function: XLMRobertaTokenizerFast.from_pretrained(). I tried it on transformers==4.2.1 to just double-check if it wasn't any bug in the 4.26.0 version or not. The error stack trace on both the tried transformers version on python 3.6 is as below:
Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 59 column 3

Is this expected? Are tokenizers supposed to be backwards incompatible across different transformer lib versions? Installing from scratch python 3.7 isn't trivial on this instance, hence request you to please help if anything can be done here as a workaround. While training the tokenizer I didn't do any extravagant - initialised a SentencePieceBPETokenizer() and just trained it from scratch by invoking .train() on my corpus.

Strangely the trained model on python 3.7 instance is loading perfectly on python 3.6 instance. So the issue is only with the tokenizer.

@Narsil request your help on this^. I can't post the same tokenizer here due to confidentiality reasons. But if you need any other info from me to help with this, please feel free to request right away.

ashutoshsaboo avatar Mar 07 '23 16:03 ashutoshsaboo

Can you check your tokenizers versions ? I think they are not the same major. (probably).

tokenizers is designed to be backwards compatible, but you're talking here about forward compatibility (some artefact created with a newer version working on an older version).

I can't tell you exactly what's going on, but the pre_tokenizer in the JSON file cannot be read by your older version. We did change the layout at some point, but again, in a backward compatbile fashion (older JSON are still read, but newer ones are written to disk).

It's probably not too hard to modify the 3.7 version to be loadable in your 3.6 environment. Just train a dummy model in the same fashion and look at how it's saved on disk in the old version. Can you do exactly the same thing ? I'm not sure it depends on your options you choose, and if they were only implemented later.

Have your tried using pyenv ? It's usually pretty good at installing different python versions on most systems (not sure it works in your case)

Does it make sense ?

If you happen to modify a JSON manaully, please double check the output of the tokenizer afterwards, it's easy to introduce subtle bugs without realizing.

Narsil avatar Mar 07 '23 16:03 Narsil

woohoo editing the JSON worked! :D many thanks! @Narsil as a suggestion: should this forward compatibility changes across tokenizer versions be more specifically documented somewhere, so it's accessible easily?

FYI -- I just had to add "str_rep": "▁", in decoder as well as pre_tokenizer keys of the python 3.7 trained tokenizer.json to get it work on 3.6 version.

ashutoshsaboo avatar Mar 07 '23 17:03 ashutoshsaboo

should this forward compatibility changes across tokenizer versions be more specifically documented somewhere, so it's accessible easily?

There's a changelog + releases : https://github.com/huggingface/tokenizers/releases?page=2 Should be enough (but not necessarily easily discoverable.

Please triple check the output ids before claiming victory :)

Narsil avatar Mar 07 '23 17:03 Narsil

Sorry what do you mean by output ids? Output ids of a tokenised sentence in python 3.7 instance and python 3.6 instance should assert to be equal - do you mean that?

On Tue, 7 Mar 2023 at 22:59, Nicolas Patry @.***> wrote:

should this forward compatibility changes across tokenizer versions be more specifically documented somewhere, so it's accessible easily?

There's a changelog + releases : https://github.com/huggingface/tokenizers/releases?page=2 Should be enough (but not necessarily easily discoverable.

Please triple check the output ids before claiming victory :)

— Reply to this email directly, view it on GitHub https://github.com/huggingface/tokenizers/issues/566#issuecomment-1458558975, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACLRHEMSKDXWZUJQNCRSM7LW25V7LANCNFSM4U6S3ODQ . You are receiving this because you commented.Message ID: @.***>

ashutoshsaboo avatar Mar 07 '23 18:03 ashutoshsaboo

I mean that the encodings are exactly the same on a larger enough subset of text. (tokenizer.encode(mystring))

Narsil avatar Mar 08 '23 10:03 Narsil

I am having this problem. Here is the reproducible script:

from tokenizers.trainers import BpeTrainer
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Split

# https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
t = """First Citizen:
Before we proceed any further, hear me speak.

..."""

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]"], vocab_size=1000, min_frequency=2)
tokenizer.pre_tokenizer = Split("\w+|[^\w\s]+", behavior="isolated")

tokenizer.train_from_iterator(
    iterator=[t],
    trainer=trainer,
)

tokenizer.save("tokenizer.json")

Works fine if I use trained tokenizer directly (not loading from the file)

print(tokenizer.encode("""especially       against Caius Marcius?

All:
Against""").tokens)

Output: ['es', 'p', 'ec', 'i', 'all', 'y ', ' ', ' ', ' ', ' ', ' ', ' a', 'gainst ', 'Caius Marc', 'i', 'us', '?\n\nAll:\n', 'A', 'gain', 'st']

But loading the tokenizer from the file fails.

tokenizer = Tokenizer.from_file("tokenizer.json")
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[88], line 1
----> 1 tokenizer = Tokenizer.from_file("tokenizer.json")

Exception: data did not match any variant of untagged enum ModelWrapper at line 382 column 3

Version: tokenizers==0.13.3

delgermurun avatar Jul 16 '23 09:07 delgermurun

Can you open a new issue please ?

It's not really good practice to resurrect old threads as it pollutes searches with potentially irrelevant content, and makes your issue which is likely a new bug less discoverable for others. (Ofc it's good to search beforehand to prevent duplicates, but when the thread is super old or closed, you can most likely create a new thread, and link the old one you found just it case we want to merge)

Narsil avatar Jul 17 '23 19:07 Narsil

Ok looked at this issue (I will copy it into a new issue once there's one).

The error is because of the current tokenizer format which expects the merges part of the file to not contain any space There's a very old draft PR https://github.com/huggingface/tokenizers/pull/909 that I made that can unlock that use case.

This wasn't implemented at the time, because changing the format is a pretty risky change for backward compatibility, and there didn't seem to be any real world use case.

Narsil avatar Jul 17 '23 20:07 Narsil

I had the same error when loading LLama 2 models. Upgrading to transformers==4.33.2 and tokenizers==0.13.3 solved it for me.

mpjanus avatar Sep 21 '23 12:09 mpjanus