tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Rust: How to handle models with `precompiled_charsmap = null`

Open kallebysantos opened this issue 1 year ago • 5 comments

Hi guys, I'm currently working on https://github.com/supabase/edge-runtime/pull/368 that pretends to add a rust implementation of pipeline().

While I was coding the translation task I figured out that I can't load the Tokenizer instance for Xenova/opus-mt-en-fr onnx model and their other opus-mt-* variants.

I got the following:
let tokenizer_path = Path::new("opus-mt-en-fr/tokenizer.json");
let tokenizer = Tokenizer::from_file(tokenizer_path).unwrap();
thread 'main' panicked at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/normalizers/mod.rs:143:26:
Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)
stack backtrace:
   0: rust_begin_unwind
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/std/src/panicking.rs:662:5
   1: core::panicking::panic_fmt
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/panicking.rs:74:14
   2: core::result::unwrap_failed
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/result.rs:1679:5
   3: core::result::Result<T,E>::expect
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/result.rs:1059:23
   4: <tokenizers::normalizers::NormalizerWrapper as serde::de::Deserialize>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/normalizers/mod.rs:139:25
   5: <serde::de::impls::OptionVisitor<T> as serde::de::Visitor>::visit_some
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/impls.rs:916:9
   6: <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_option
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1672:18
   7: serde::de::impls::<impl serde::de::Deserialize for core::option::Option<T>>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/impls.rs:935:9
   8: <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/mod.rs:801:9
   9: <serde_json::de::MapAccess<R> as serde::de::MapAccess>::next_value_seed
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2008:9
  10: serde::de::MapAccess::next_value
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/mod.rs:1874:9
  11: <tokenizers::tokenizer::serialization::TokenizerVisitor<M,N,PT,PP,D> as serde::de::Visitor>::visit_map
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/serialization.rs:132:55
  12: <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_struct
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1840:31
  13: tokenizers::tokenizer::serialization::<impl serde::de::Deserialize for tokenizers::tokenizer::TokenizerImpl<M,N,PT,PP,D>>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/serialization.rs:62:9
  14: <tokenizers::tokenizer::_::<impl serde::de::Deserialize for tokenizers::tokenizer::Tokenizer>::deserialize::__Visitor as serde::de::Visitor>::visit_newtype_struct
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:408:21
  15: <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_newtype_struct
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1723:9
  16: tokenizers::tokenizer::_::<impl serde::de::Deserialize for tokenizers::tokenizer::Tokenizer>::deserialize
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:408:21
  17: serde_json::de::from_trait
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2478:22
  18: serde_json::de::from_str
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2679:5
  19: tokenizers::tokenizer::Tokenizer::from_file
             at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:439:25
  20: transformers_rs::pipeline::tasks::seq_to_seq::seq_to_seq
             at ./src/pipeline/tasks/seq_to_seq.rs:51:21
  21: app::main
             at ./examples/app/src/main.rs:78:5
  22: core::ops::function::FnOnce::call_once
             at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/ops/function.rs:250:5

I now that it occurs because their tokenizer.json file was the following:

opus-mt-en-fr:

"normalizer": {
    "type": "Precompiled",
    "precompiled_charsmap": null
}

While the expected behavior must be something like this:

nllb-200-distilled-600M:

"normalizer": {                           
   "type": "Sequence",                     
   "normalizers": [                        
     {                                     
       "type": "Precompiled",              
       "precompiled_charsmap": "ALQCAACEAAA..."
     }                                    
   ]                                       
 }

Looking in the original version of Helsinki-NLP/opus-mt-en-fr I notice that is no tokenizer.json file for it.

I would like to know if is the precompiled_charsmap necessary expect a non-null?

Maybe it could be handle as Option<_>?

Is there some workaround to execute theses models without change the internal model files? How can I handle an exported onnx model that doesn't have the tokenizer.json file?

kallebysantos avatar Sep 04 '24 08:09 kallebysantos

I'm seeing the same error with Python when trying to read the tokenizer from Xenova/speecht5_tts.

wget https://huggingface.co/Xenova/speecht5_tts/resolve/main/tokenizer.json
from tokenizers import Tokenizer

Tokenizer.from_file("tokenizer.json")
thread '<unnamed>' panicked at /Users/runner/work/tokenizers/tokenizers/tokenizers/src/normalizers/mod.rs:143:26:
Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
...
pyo3_runtime.PanicException: Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)

With Tokenizers 0.19.0, this raised an error which could be handled rather than a panic. It looks like this may be related to #1604.

ankane avatar Sep 18 '24 01:09 ankane

I'm also facing the same issue (#1645) with speecht5_tts.

vicantwin avatar Oct 05 '24 20:10 vicantwin

I think passing a "" might work. cc @xenova not sure why you end up with nulls there, but we can probably syn and I can add support for option!

ArthurZucker avatar Oct 06 '24 08:10 ArthurZucker

I think passing a "" might work. cc @xenova not sure why you end up with nulls there, but we can probably syn and I can add support for option!

Xenova implementation doesn't call the value directly but applies iterators over config normalizers. I think that it ignores the null values.

I agree with you, add support for Option<> may solve it.

kallebysantos avatar Oct 06 '24 08:10 kallebysantos

I've implemented spm_precompiled with null support at vicantwin/spm_precompiled, which includes a test with null support, and all tests pass successfully.

But, I need some help with changing this repository, as I'm not entirely familiar with this codebase and unsure how to implement the necessary changes. Any help would be greatly appreciated.

vicantwin avatar Oct 06 '24 15:10 vicantwin