Tokenizer throwing PanicException
thread '
' panicked at 'called Result::unwrap()on anErrvalue: Internal', /__w/tokenizers/tokenizers/tokenizers/src/models/unigram/trainer.rs:203:53 note: run withRUST_BACKTRACE=1environment variable to display a backtrace Preparing data... Training tokenizer... Traceback (most recent call last): File "tokenization_2.py", line 26, intokenizer.train([args.file], trainer) pyo3_runtime.PanicException: called Result::unwrap()on anErrvalue: Internal
Hi @tanmaylaud,
Can you provide a script that triggered the error maybe? Or some more context? Without it it's a bit hard to help. cheers
@Narsil It is a pretty simple script. I am just passing a text file to the tokenizer.train function. The tokenizer I am using is Unigram. The text file has 25 million rows. Let me know if you need more information
Error seems to be located on the esaxx (suffix array) call, which is most likely a C++ error (code was taken directly from sentencepiece for this one).
Could you add RUST_BACKTRACE=1 and trigger the error again ? I expect the error to be SuffixError::Internal.
If that's the case, I wouldn't have an easy solution.
It could be an int32 outage (that would require a special build of the library to circumvent but it's doable).
It could be some other memory error in the Cpp code, which I wouldn't be able to debug nicely, using essaxx_rs::suffix_rs (slower rust version) instead of esaxx::suffix (cpp) might yield better information as to what's causing this bug.
Here is the trace with RUST_BACKTRACE=1:
thread '
' panicked at 'called Result::unwrap()on anErrvalue: Internal', /__w/tokenizers/tokenizers/tokenizers/src/models/unigram/trainer.rs:203:53 stack backtrace: 0: rust_begin_unwind at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/std/src/panicking.rs:493:5 1: core::panicking::panic_fmt at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/panicking.rs:92:14 2: core::option::expect_none_failed at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/option.rs:1329:5 3: tokenizers::models::unigram::trainer::UnigramTrainer::do_train 4: <tokenizers::models::TrainerWrapper as tokenizers::tokenizer::Trainer>::train 5: <tokenizers::trainers::PyTrainer as tokenizers::tokenizer::Trainer>::train 6: tokenizers::utils::iter::ResultShunt<I,E>::process 7: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once 8: pyo3::python::Python::allow_threads 9: tokenizers::tokenizer::PyTokenizer::train 10: tokenizers::tokenizer::__init10915892733224078279::__init10915892733224078279::__wrap::{{closure}} 11: tokenizers::tokenizer::__init10915892733224078279::__init10915892733224078279::__wrap 12:13: _PyEval_EvalFrameDefault 14: _PyEval_EvalCodeWithName 15: PyEval_EvalCode 16: 17: 18: PyRun_FileExFlags 19: PyRun_SimpleFileExFlags 20: Py_RunMain 21: Py_BytesMain 22: __libc_start_main 23: _start note: Some details are omitted, run with RUST_BACKTRACE=fullfor a verbose backtrace. Preparing data... Training tokenizer... Traceback (most recent call last): File "tokenization_2.py", line 26, intokenizer.train([args.paranmt_file], trainer) pyo3_runtime.PanicException: called Result::unwrap()on anErrvalue: Internal
Ok, it's what I said, Internal error means the error occurs within the cpp code,
Can you build from source ?
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
pip install -e .
python tokenization_2.py
That should trigger the same error then you can
Open tokenizers/src/models/unigram/trainer.rs at line 203 and replace suffix with suffix_rs
The rebuild and rerun
pip install -e .
python tokenization_2.py
Other than that, you could also try to provide sufficient information so we could reproduce the bug.
Cheers.
I expect this error to be linked to overflowing i32, which unfortunately we don't support u64 in tokenizers. If you are able to rebuild, you could look into making a branch to build with everything i32 becoming i64 or u64. It's probably going to be a big tedious but it's doable.
@Narsil , so should I do the rebuild with steps you mentioned previously or the latest comment? if it's the latest one, where should I make the change for overflow?
The first one is to make sure that's the case, the second would be to solve it.
My first priority would be to confirm the intuition is correct, and only then do the rebuild. I'm mentionning it ahead of time so we don't have to back&forth.
If you can provide something reproducible I would happily take care of it too btw.
@Narsil, have you'll tested tokenizer trainer for a really large dataset? Consider a dataset of roughly >25M training examples. I have done nothing special but passed a large dataset to the trainer. For example, consider this dataset: https://drive.google.com/file/d/1rbF3daJjCsa1-fu2GANeJd2FBXos1ugD/view?usp=sharing
The error should be reproducible by this. I tried converting all u32 to u64, but getting many errors.
Do you mind also sharing your tokenizer config ? (pre_tokenizers, normalizers etc..)
They ahve a big impact of the numbers so probability of overflow.
Here is the full script:
from tokenizers import Tokenizer from tokenizers.models import Unigram, WordPiece from tokenizers.trainers import UnigramTrainer, WordPieceTrainer from tokenizers.normalizers import NFKC from sacremoses import MosesTokenizer from transformers import PreTrainedTokenizerFast import argparse import fileinput from tqdm import tqdm
def prepare_data(filename, pretokenizer): data = [] with open(filename, 'r') as f: lines = f.readlines() for line in tqdm(lines): sent1, sent2 = line.strip().lower().split('\t') sent1 = pretokenizer.tokenize(sent1) sent2 = pretokenizer.tokenize(sent2) #print(sentences) data.append(" ".join(sent1)) data.append(" ".join(sent2)) return data
parser = argparse.ArgumentParser()
parser.add_argument('--paranmt-file')
args = parser.parse_args()
tokenizer = Tokenizer(Unigram()) #tokenizer = Tokenizer(WordPiece()) tokenizer.normalizer = NFKC() trainer = UnigramTrainer(unk_token='[UNK]', special_tokens=["[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=50000) #trainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=50000) print('Preparing data...') data = prepare_data(args.paranmt_file, pretokenizer=MosesTokenizer()) print('Training tokenizer...') tokenizer.train_from_iterator(data, trainer) print('Saving file...') tokenizer.save("tokenizer_uni_25M_50k.json") fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer) fast_tokenizer.save_pretrained('./')
Is this problem solved? I ran into a similar problem
I ran successfully on the 50M data set, but when running the same code on the 5G data set, a similar error will be reported.
Do you have a reproducible script ? It sounds like a buffer overflow.
This library only uses u32 for most of counting things, meaning that large dataset (especially without careful pretokenization) are likely to trigger overflow which can cause pretty much arbitrary damage.
For such large datasets sentencepiece supports using u64 instead which should work better (it's then possible to convert after the fact).
If someone is interested in supporting u32/u64 support, help is appreciated.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.