PanicException For Result::unwarp()
I was trying to train an Unigram tokenizer with the settings of T5. The tokenizer I used is provided in transformers examples
The training script is almost the same as the example above. However I got the error below:
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: TryFromIntError(())', /github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/esaxx-rs-0.1.5/src/lib.rs:78:26
stack backtrace:
0: rust_begin_unwind
at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/std/src/panicking.rs:493:5
1: core::panicking::panic_fmt
at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/panicking.rs:92:14
2: core::option::expect_none_failed
at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/option.rs:1329:5
3: esaxx_rs::suffix
4: tokenizers::models::unigram::trainer::UnigramTrainer::do_train
5: <tokenizers::models::TrainerWrapper as tokenizers::tokenizer::Trainer>::train
6: <tokenizers::trainers::PyTrainer as tokenizers::tokenizer::Trainer>::train
7: tokenizers::tokenizer::TokenizerImpl<M,N,PT,PP,D>::train
8: tokenizers::utils::iter::ResultShunt<I,E>::process
9: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
10: pyo3::python::Python::allow_threads
11: tokenizers::tokenizer::PyTokenizer::train_from_iterator
12: tokenizers::tokenizer::__init10915892733224078279::__init10915892733224078279::__wrap::{{closure}}
13: tokenizers::tokenizer::__init10915892733224078279::__init10915892733224078279::__wrap
14: cfunction_call_varargs
at /tmp/build/80754af9/python-split_1634043551344/work/Objects/call.c:743
15: _PyObject_MakeTpCall
at /tmp/build/80754af9/python-split_1634043551344/work/Objects/call.c:159
16: _PyObject_Vectorcall
at /tmp/build/80754af9/python-split_1634043551344/work/Include/cpython/abstract.h:125
17: call_function
at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:4963
18: _PyEval_EvalFrameDefault
at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:3515
19: PyEval_EvalFrameEx
at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:741
20: _PyEval_EvalCodeWithName
at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:4298
21: _PyFunction_Vectorcall
at /tmp/build/80754af9/python-split_1634043551344/work/Objects/call.c:436
22: _PyObject_Vectorcall
at /tmp/build/80754af9/python-split_1634043551344/work/Include/cpython/abstract.h:127
23: method_vectorcall
at /tmp/build/80754af9/python-split_1634043551344/work/Objects/classobject.c:60
24: _PyObject_Vectorcall
at /tmp/build/80754af9/python-split_1634043551344/work/Include/cpython/abstract.h:127
25: call_function
at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:4963
26: _PyEval_EvalFrameDefault
at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:3515
27: PyEval_EvalFrameEx
at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:741
28: function_code_fastcall
at /tmp/build/80754af9/python-split_1634043551344/work/Objects/call.c:284
29: _PyFunction_Vectorcall
at /tmp/build/80754af9/python-split_1634043551344/work/Objects/call.c:411
30: _PyObject_Vectorcall
at /tmp/build/80754af9/python-split_1634043551344/work/Include/cpython/abstract.h:127
31: call_function
at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:4963
32: _PyEval_EvalFrameDefault
at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:3500
33: PyEval_EvalFrameEx
at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:741
34: _PyEval_EvalCodeWithName
at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:4298
35: PyEval_EvalCodeEx
at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:4327
36: PyEval_EvalCode
at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:718
37: run_eval_code_obj
at /tmp/build/80754af9/python-split_1634043551344/work/Python/pythonrun.c:1166
38: run_mod
at /tmp/build/80754af9/python-split_1634043551344/work/Python/pythonrun.c:1188
39: pyrun_file
at /tmp/build/80754af9/python-split_1634043551344/work/Python/pythonrun.c:1085
40: pyrun_simple_file
at /tmp/build/80754af9/python-split_1634043551344/work/Python/pythonrun.c:439
41: PyRun_SimpleFileExFlags
at /tmp/build/80754af9/python-split_1634043551344/work/Python/pythonrun.c:472
42: pymain_run_file
at /tmp/build/80754af9/python-split_1634043551344/work/Modules/main.c:391
43: pymain_run_python
at /tmp/build/80754af9/python-split_1634043551344/work/Modules/main.c:616
44: Py_RunMain
at /tmp/build/80754af9/python-split_1634043551344/work/Modules/main.c:695
45: Py_BytesMain
at /tmp/build/80754af9/python-split_1634043551344/work/Modules/main.c:1127
46: __libc_start_main
47: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
File "train_tokenizer.py", line 103, in <module>
main()
File "train_tokenizer.py", line 98, in main
tokenizer.train_from_iterator(batch_iterator(input_sentence_size = None, dataset = dataset),
File "/mnt/cache/t5-pretrain/t5_tokenizer_model.py", line 102, in train_from_iterator
self._tokenizer.train_from_iterator(iterator, trainer=trainer)
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: TryFromIntError(())
After the batch_iterator retrieved all the elements in the dataset, this error occurred.
The dataset is a Chinese text dataset with 26GB size
Hi @Namco0816 , you dataset is probably big enough to outrange i32 (2147483647).
This is unfortunately a known limitation of this library, which doesn't gracefully upgrade to u64 when such big datasets are used.
Options you have:
- Limit the size of your dataset (tokenizer have diminishing returns, so training on maybe 2Go is not so bad, but I can't confirm specially for chinese).
- Use
sentencepiecethat has support for graceful upgrade tou64then convert it back totokenizerslater (There are example scripts intransformerslibrary) or I can help you. - Start the PR to use
u64when the values are too large. This is likely to be significant work, but it would be very welcome, and I can help you with guidance if you want.
Cheers.
Thank you for your response. I will train the tokenizer with SentencePiece and convert the model to a T5Tokenzier. Thank you!
Same issue encountered when Training an XLNet tokenizer on 100+ GB dataset
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: TryFromIntError(())', /github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/esaxx-rs-0.1.7/src/lib.rs:78:26
--
stack backtrace:
0: rust_begin_unwind at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:584:5 1: core::panicking::
panic_fmt at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/panicking.rs:143:14 2: core::result::unwrap_failed at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/result.rs:1749:5
3: esaxx_rs::suffix
4: tokenizers::models::unigram::trainer::UnigramTrainer::do_train
5: <tokenizers::models::TrainerWrapper as tokenizers
::tokenizer::Trainer>::train
6: <tokenizers::trainers::PyTrainer as tokenizers
::tokenizer::Trainer>::train
7: tokenizers::tokenizer::TokenizerImpl<M,
N,PT,PP,D>::train
8: tokenizers::utils::iter::ResultShunt<
I,E>::process
9: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
10: pyo3::python::Python::allow_threads
11: tokenizers::tokenizer::PyTokenizer::train_from_iterator
12: std::panicking::try
13: tokenizers::tokenizer::__init2748433529733916248::__wrap
14: cfunction_call_varargs at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/call.c:743:19 15: PyCFunction_Call at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/call.c:773 16: _PyObject_MakeTpCall at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/call.c:159
17: _PyObject_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Include/cpython/abstract.h:125:16 18: _PyObject_Vectorcall
at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Include/cpython/abstract.h:115:1 19: call_function at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:4963 20
: _PyEval_EvalFrameDefault at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:3515
21: PyEval_EvalFrameEx at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:741:12 22: _PyEval_EvalCodeWithName at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:4298 23: _PyFunction_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/call.c:436 24: _PyObject_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Include/cpython/abstract.h:127
25: method_vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/classobject.c:60 26: _PyObject_Vectorcall at
/home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Include/cpython/abstract.h:127:11 27: call_function at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:4963 28: _PyEval_EvalFrameDefault at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:3515 29: PyEval_EvalFrameEx at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:741:12 30: _PyEval_EvalCodeWithName at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:4298 31: _PyFunction_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/call.c:436 32: _PyObject_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Include/cpython/abstract.h:127 33: method_vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/classobject.c:60 34: _PyObject_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Include/cpython/abstract.h:127:11 35: call_function at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:4963 36: _PyEval_EvalFrameDefault at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:3515
37: PyEval_EvalFrameEx at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:741:12 38: function_code_fastcall at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/call.c:284 39: _PyFunction_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/call.c:411 40: _PyObject_Vectorcall at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Include/cpython/abstract.h:127:11 41: call_function at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:4963 42: _PyEval_EvalFrameDefault at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:3500 43: PyEval_EvalFrameEx at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:741:12 44: _PyEval_EvalCodeWithName at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:4298 45: PyEval_EvalCodeEx at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:4327 46: PyEval_EvalCode at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:718 47: run_eval_code_obj at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/pythonrun.c:1166 48: run_mod at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/pythonrun.c:1188 49: pyrun_file at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/pythonrun.c:1085 50: pyrun_simple_file at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/pythonrun.c:439 51: PyRun_SimpleFileExFlags at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/pythonrun.c:472 52: pymain_run_file at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Modules/main.c:391 53: pymain_run_python at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Modules/main.c:616:21 54: Py_RunMain at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Modules/main.c:695 55: Py_BytesMain at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Modules/main.c:1127
56: __libc_start_main 57: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last): File "train_tokenizer.py", line 245, in <module>
--
main() File "train_tokenizer.py", line 241, in main
trainer.train(args.corpus_dir_path, save_dir=args.save_to) File "train_tokenizer.py", line 149, in train
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.