tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

PanicException For Result::unwarp()

Open Namco0816 opened this issue 3 years ago • 4 comments

I was trying to train an Unigram tokenizer with the settings of T5. The tokenizer I used is provided in transformers examples

The training script is almost the same as the example above. However I got the error below:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: TryFromIntError(())', /github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/esaxx-rs-0.1.5/src/lib.rs:78:26
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/std/src/panicking.rs:493:5
   1: core::panicking::panic_fmt
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/panicking.rs:92:14
   2: core::option::expect_none_failed
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/option.rs:1329:5
   3: esaxx_rs::suffix
   4: tokenizers::models::unigram::trainer::UnigramTrainer::do_train
   5: <tokenizers::models::TrainerWrapper as tokenizers::tokenizer::Trainer>::train
   6: <tokenizers::trainers::PyTrainer as tokenizers::tokenizer::Trainer>::train
   7: tokenizers::tokenizer::TokenizerImpl<M,N,PT,PP,D>::train
   8: tokenizers::utils::iter::ResultShunt<I,E>::process
   9: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
  10: pyo3::python::Python::allow_threads
  11: tokenizers::tokenizer::PyTokenizer::train_from_iterator
  12: tokenizers::tokenizer::__init10915892733224078279::__init10915892733224078279::__wrap::{{closure}}
  13: tokenizers::tokenizer::__init10915892733224078279::__init10915892733224078279::__wrap
  14: cfunction_call_varargs
             at /tmp/build/80754af9/python-split_1634043551344/work/Objects/call.c:743
  15: _PyObject_MakeTpCall
             at /tmp/build/80754af9/python-split_1634043551344/work/Objects/call.c:159
  16: _PyObject_Vectorcall
             at /tmp/build/80754af9/python-split_1634043551344/work/Include/cpython/abstract.h:125
  17: call_function
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:4963
  18: _PyEval_EvalFrameDefault
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:3515
  19: PyEval_EvalFrameEx
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:741
  20: _PyEval_EvalCodeWithName
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:4298
  21: _PyFunction_Vectorcall
             at /tmp/build/80754af9/python-split_1634043551344/work/Objects/call.c:436
  22: _PyObject_Vectorcall
             at /tmp/build/80754af9/python-split_1634043551344/work/Include/cpython/abstract.h:127
  23: method_vectorcall
             at /tmp/build/80754af9/python-split_1634043551344/work/Objects/classobject.c:60
  24: _PyObject_Vectorcall
             at /tmp/build/80754af9/python-split_1634043551344/work/Include/cpython/abstract.h:127
  25: call_function
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:4963
  26: _PyEval_EvalFrameDefault
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:3515
  27: PyEval_EvalFrameEx
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:741
  28: function_code_fastcall
             at /tmp/build/80754af9/python-split_1634043551344/work/Objects/call.c:284
  29: _PyFunction_Vectorcall
             at /tmp/build/80754af9/python-split_1634043551344/work/Objects/call.c:411
  30: _PyObject_Vectorcall
             at /tmp/build/80754af9/python-split_1634043551344/work/Include/cpython/abstract.h:127
  31: call_function
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:4963
  32: _PyEval_EvalFrameDefault
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:3500
  33: PyEval_EvalFrameEx
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:741
  34: _PyEval_EvalCodeWithName
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:4298
  35: PyEval_EvalCodeEx
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:4327
  36: PyEval_EvalCode
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/ceval.c:718
  37: run_eval_code_obj
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/pythonrun.c:1166
  38: run_mod
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/pythonrun.c:1188
  39: pyrun_file
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/pythonrun.c:1085
  40: pyrun_simple_file
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/pythonrun.c:439
  41: PyRun_SimpleFileExFlags
             at /tmp/build/80754af9/python-split_1634043551344/work/Python/pythonrun.c:472
  42: pymain_run_file
             at /tmp/build/80754af9/python-split_1634043551344/work/Modules/main.c:391
  43: pymain_run_python
             at /tmp/build/80754af9/python-split_1634043551344/work/Modules/main.c:616
  44: Py_RunMain
             at /tmp/build/80754af9/python-split_1634043551344/work/Modules/main.c:695
  45: Py_BytesMain
             at /tmp/build/80754af9/python-split_1634043551344/work/Modules/main.c:1127
  46: __libc_start_main
  47: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Traceback (most recent call last):
  File "train_tokenizer.py", line 103, in <module>
    main()
  File "train_tokenizer.py", line 98, in main
    tokenizer.train_from_iterator(batch_iterator(input_sentence_size = None, dataset = dataset),
  File "/mnt/cache/t5-pretrain/t5_tokenizer_model.py", line 102, in train_from_iterator
    self._tokenizer.train_from_iterator(iterator, trainer=trainer)
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: TryFromIntError(())

After the batch_iterator retrieved all the elements in the dataset, this error occurred.

Namco0816 avatar Jan 21 '22 03:01 Namco0816

The dataset is a Chinese text dataset with 26GB size

Namco0816 avatar Jan 21 '22 03:01 Namco0816

Hi @Namco0816 , you dataset is probably big enough to outrange i32 (2147483647).

This is unfortunately a known limitation of this library, which doesn't gracefully upgrade to u64 when such big datasets are used.

Options you have:

  • Limit the size of your dataset (tokenizer have diminishing returns, so training on maybe 2Go is not so bad, but I can't confirm specially for chinese).
  • Use sentencepiece that has support for graceful upgrade to u64 then convert it back to tokenizers later (There are example scripts in transformers library) or I can help you.
  • Start the PR to use u64 when the values are too large. This is likely to be significant work, but it would be very welcome, and I can help you with guidance if you want.

Cheers.

Narsil avatar Jan 21 '22 08:01 Narsil

Thank you for your response. I will train the tokenizer with SentencePiece and convert the model to a T5Tokenzier. Thank you!

Namco0816 avatar Jan 24 '22 10:01 Namco0816

Same issue encountered when Training an XLNet tokenizer on 100+ GB dataset



thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: TryFromIntError(())', /github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/esaxx-rs-0.1.7/src/lib.rs:78:26
--
stack backtrace:
0: rust_begin_unwind             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:584:5   1: core::panicking::
panic_fmt             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/panicking.rs:143:14   2: core::result::unwrap_failed             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/result.rs:1749:5
3: esaxx_rs::suffix
4: tokenizers::models::unigram::trainer::UnigramTrainer::do_train
5: <tokenizers::models::TrainerWrapper as tokenizers
::tokenizer::Trainer>::train
6: <tokenizers::trainers::PyTrainer as tokenizers
::tokenizer::Trainer>::train
7: tokenizers::tokenizer::TokenizerImpl<M,
N,PT,PP,D>::train
8: tokenizers::utils::iter::ResultShunt<
I,E>::process
9: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
10: pyo3::python::Python::allow_threads
11: tokenizers::tokenizer::PyTokenizer::train_from_iterator
12: std::panicking::try
13: tokenizers::tokenizer::__init2748433529733916248::__wrap
14: cfunction_call_varargs             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/call.c:743:19  15: PyCFunction_Call             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/call.c:773  16: _PyObject_MakeTpCall             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/call.c:159
17: _PyObject_Vectorcall             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Include/cpython/abstract.h:125:16  18: _PyObject_Vectorcall
at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Include/cpython/abstract.h:115:1  19: call_function             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:4963  20
: _PyEval_EvalFrameDefault             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:3515
21: PyEval_EvalFrameEx             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:741:12  22: _PyEval_EvalCodeWithName             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:4298  23: _PyFunction_Vectorcall             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/call.c:436  24: _PyObject_Vectorcall             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Include/cpython/abstract.h:127
25: method_vectorcall             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/classobject.c:60  26: _PyObject_Vectorcall             at
/home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Include/cpython/abstract.h:127:11  27: call_function             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:4963  28: _PyEval_EvalFrameDefault             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:3515  29: PyEval_EvalFrameEx             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:741:12  30: _PyEval_EvalCodeWithName             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:4298  31: _PyFunction_Vectorcall             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/call.c:436  32: _PyObject_Vectorcall             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Include/cpython/abstract.h:127  33: method_vectorcall             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/classobject.c:60  34: _PyObject_Vectorcall             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Include/cpython/abstract.h:127:11  35: call_function             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:4963  36: _PyEval_EvalFrameDefault             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:3515
37: PyEval_EvalFrameEx             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:741:12  38: function_code_fastcall             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/call.c:284  39: _PyFunction_Vectorcall             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Objects/call.c:411  40: _PyObject_Vectorcall             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Include/cpython/abstract.h:127:11  41: call_function             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:4963  42: _PyEval_EvalFrameDefault             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:3500  43: PyEval_EvalFrameEx             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:741:12  44: _PyEval_EvalCodeWithName             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:4298  45: PyEval_EvalCodeEx             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:4327  46: PyEval_EvalCode             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/ceval.c:718  47: run_eval_code_obj             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/pythonrun.c:1166  48: run_mod             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/pythonrun.c:1188  49: pyrun_file             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/pythonrun.c:1085  50: pyrun_simple_file             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/pythonrun.c:439  51: PyRun_SimpleFileExFlags             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Python/pythonrun.c:472  52: pymain_run_file             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Modules/main.c:391  53: pymain_run_python             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Modules/main.c:616:21  54: Py_RunMain             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Modules/main.c:695  55: Py_BytesMain             at /home/conda/feedstock_root/build_artifacts/python-split_1631566923692/work/Modules/main.c:1127
56: __libc_start_main  57: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):  File "train_tokenizer.py", line 245, in <module>
--
main()  File "train_tokenizer.py", line 241, in main
trainer.train(args.corpus_dir_path, save_dir=args.save_to)  File "train_tokenizer.py", line 149, in train

LiutongZhou avatar Jun 03 '22 19:06 LiutongZhou

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Mar 01 '24 01:03 github-actions[bot]