tokenizers Tokenizer throwing PanicException

thread '' panicked at 'called Result::unwrap() on an Err value: Internal', /__w/tokenizers/tokenizers/tokenizers/src/models/unigram/trainer.rs:203:53 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace Preparing data... Training tokenizer... Traceback (most recent call last): File "tokenization_2.py", line 26, in tokenizer.train([args.file], trainer) pyo3_runtime.PanicException: called Result::unwrap() on an Err value: Internal

Nov 02 '21 05:11 tanmaylaud

Hi @tanmaylaud,

Can you provide a script that triggered the error maybe? Or some more context? Without it it's a bit hard to help. cheers

Nov 02 '21 06:11 Narsil

@Narsil It is a pretty simple script. I am just passing a text file to the tokenizer.train function. The tokenizer I am using is Unigram. The text file has 25 million rows. Let me know if you need more information

Nov 02 '21 06:11 tanmaylaud

Error seems to be located on the esaxx (suffix array) call, which is most likely a C++ error (code was taken directly from sentencepiece for this one).

Could you add RUST_BACKTRACE=1 and trigger the error again ? I expect the error to be SuffixError::Internal.

If that's the case, I wouldn't have an easy solution.

It could be an int32 outage (that would require a special build of the library to circumvent but it's doable). It could be some other memory error in the Cpp code, which I wouldn't be able to debug nicely, using essaxx_rs::suffix_rs (slower rust version) instead of esaxx::suffix (cpp) might yield better information as to what's causing this bug.

Nov 02 '21 08:11 Narsil

Here is the trace with RUST_BACKTRACE=1:

thread '' panicked at 'called Result::unwrap() on an Err value: Internal', /__w/tokenizers/tokenizers/tokenizers/src/models/unigram/trainer.rs:203:53 stack backtrace: 0: rust_begin_unwind at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/std/src/panicking.rs:493:5 1: core::panicking::panic_fmt at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/panicking.rs:92:14 2: core::option::expect_none_failed at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/option.rs:1329:5 3: tokenizers::models::unigram::trainer::UnigramTrainer::do_train 4: <tokenizers::models::TrainerWrapper as tokenizers::tokenizer::Trainer>::train 5: <tokenizers::trainers::PyTrainer as tokenizers::tokenizer::Trainer>::train 6: tokenizers::utils::iter::ResultShunt<I,E>::process 7: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once 8: pyo3::python::Python::allow_threads 9: tokenizers::tokenizer::PyTokenizer::train 10: tokenizers::tokenizer::__init10915892733224078279::__init10915892733224078279::__wrap::{{closure}} 11: tokenizers::tokenizer::__init10915892733224078279::__init10915892733224078279::__wrap 12: 13: _PyEval_EvalFrameDefault 14: _PyEval_EvalCodeWithName 15: PyEval_EvalCode 16: 17: 18: PyRun_FileExFlags 19: PyRun_SimpleFileExFlags 20: Py_RunMain 21: Py_BytesMain 22: __libc_start_main 23: _start note: Some details are omitted, run with RUST_BACKTRACE=full for a verbose backtrace. Preparing data... Training tokenizer... Traceback (most recent call last): File "tokenization_2.py", line 26, in tokenizer.train([args.paranmt_file], trainer) pyo3_runtime.PanicException: called Result::unwrap() on an Err value: Internal

Nov 02 '21 18:11 tanmaylaud

Ok, it's what I said, Internal error means the error occurs within the cpp code,

Can you build from source ?

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
pip install -e .
python tokenization_2.py

That should trigger the same error then you can

Open tokenizers/src/models/unigram/trainer.rs at line 203 and replace suffix with suffix_rs

The rebuild and rerun

pip install -e .
python tokenization_2.py

Other than that, you could also try to provide sufficient information so we could reproduce the bug.

Cheers.

Nov 02 '21 20:11 Narsil

I expect this error to be linked to overflowing i32, which unfortunately we don't support u64 in tokenizers. If you are able to rebuild, you could look into making a branch to build with everything i32 becoming i64 or u64. It's probably going to be a big tedious but it's doable.

Nov 02 '21 20:11 Narsil

@Narsil , so should I do the rebuild with steps you mentioned previously or the latest comment? if it's the latest one, where should I make the change for overflow?

Nov 04 '21 18:11 tanmaylaud

The first one is to make sure that's the case, the second would be to solve it.

My first priority would be to confirm the intuition is correct, and only then do the rebuild. I'm mentionning it ahead of time so we don't have to back&forth.

If you can provide something reproducible I would happily take care of it too btw.

Nov 05 '21 06:11 Narsil

@Narsil, have you'll tested tokenizer trainer for a really large dataset? Consider a dataset of roughly >25M training examples. I have done nothing special but passed a large dataset to the trainer. For example, consider this dataset: https://drive.google.com/file/d/1rbF3daJjCsa1-fu2GANeJd2FBXos1ugD/view?usp=sharing

The error should be reproducible by this. I tried converting all u32 to u64, but getting many errors.

Nov 05 '21 19:11 tanmaylaud

Do you mind also sharing your tokenizer config ? (pre_tokenizers, normalizers etc..)

They ahve a big impact of the numbers so probability of overflow.

Nov 08 '21 10:11 Narsil

Here is the full script:

from tokenizers import Tokenizer from tokenizers.models import Unigram, WordPiece from tokenizers.trainers import UnigramTrainer, WordPieceTrainer from tokenizers.normalizers import NFKC from sacremoses import MosesTokenizer from transformers import PreTrainedTokenizerFast import argparse import fileinput from tqdm import tqdm

def prepare_data(filename, pretokenizer): data = [] with open(filename, 'r') as f: lines = f.readlines() for line in tqdm(lines): sent1, sent2 = line.strip().lower().split('\t') sent1 = pretokenizer.tokenize(sent1) sent2 = pretokenizer.tokenize(sent2) #print(sentences) data.append(" ".join(sent1)) data.append(" ".join(sent2)) return data

parser = argparse.ArgumentParser()

parser.add_argument('--paranmt-file')

args = parser.parse_args()

tokenizer = Tokenizer(Unigram()) #tokenizer = Tokenizer(WordPiece()) tokenizer.normalizer = NFKC() trainer = UnigramTrainer(unk_token='[UNK]', special_tokens=["[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=50000) #trainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=50000) print('Preparing data...') data = prepare_data(args.paranmt_file, pretokenizer=MosesTokenizer()) print('Training tokenizer...') tokenizer.train_from_iterator(data, trainer) print('Saving file...') tokenizer.save("tokenizer_uni_25M_50k.json") fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer) fast_tokenizer.save_pretrained('./')

Nov 09 '21 03:11 tanmaylaud

Is this problem solved? I ran into a similar problem

Jun 09 '23 03:06 2212168851

I ran successfully on the 50M data set, but when running the same code on the 5G data set, a similar error will be reported.

Jun 09 '23 03:06 2212168851

Do you have a reproducible script ? It sounds like a buffer overflow.

This library only uses u32 for most of counting things, meaning that large dataset (especially without careful pretokenization) are likely to trigger overflow which can cause pretty much arbitrary damage.

For such large datasets sentencepiece supports using u64 instead which should work better (it's then possible to convert after the fact).

If someone is interested in supporting u32/u64 support, help is appreciated.

Jun 09 '23 07:06 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Mar 11 '24 01:03 github-actions[bot]