rucene icon indicating copy to clipboard operation
rucene copied to clipboard

Indexing too many document fails in one commit fails.

Open fulmicoton opened this issue 4 years ago • 5 comments

Context: I am adding rucene to https://github.com/tantivy-search/search-benchmark-game.

It is a search benchmarking comparing Lucene, Tantivy, Bleve and now Rucene. Indexing works but I have to periodically commit to avoid getting a panic.

See the following two lines of code and comment. https://github.com/tantivy-search/search-benchmark-game/blob/master/engines/rucene-0.1/src/bin/build_index.rs#L103-L104

(I suspect a u32 overflow)

fulmicoton avatar Dec 23 '19 00:12 fulmicoton

FYI Here is the backtrace.

doc 2420000
doc 2430000
doc 2440000
doc 2450000
doc 2460000
doc 2470000
doc 2480000
doc 2490000
doc 2500000
doc 2510000
doc 2520000
doc 2530000
thread 'main' panicked at 'index out of bounds: the len is 65537 but the index is 562949953355776', /rustc/c8ea4ace9213ae045123fdfeb59d1ac887656d31/src/libcore/slice/mod.rs:2806:10
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:84
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:61
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1025
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1426
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:65
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:50
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:193
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:210
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:471
  11: rust_begin_unwind
             at src/libstd/panicking.rs:375
  12: core::panicking::panic_fmt
             at src/libcore/panicking.rs:84
  13: core::panicking::panic_bounds_check
             at src/libcore/panicking.rs:62
  14: rucene::core::codec::postings::terms_hash_per_field::TermsHashPerFieldBase<T>::write_byte
  15: rucene::core::codec::postings::terms_hash_per_field::TermsHashPerField::add
  16: rucene::core::index::writer::doc_consumer::DocConsumer<D,C,MS,MP>::process_document
  17: rucene::core::index::writer::doc_writer::DocumentsWriter<D,C,MS,MP>::update_document
  18: build_index::main
  19: std::rt::lang_start::{{closure}}
  20: main
  21: __libc_start_main
  22: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

fulmicoton avatar Dec 23 '19 02:12 fulmicoton

Can you reproduce the panic with RUST_BACKTRACE=full enabled? There are multiple array accesses in TermsHashPerFieldBase<T>::write_byte. Line number would make it easier to find out the place caused overflow. Thanks

sunxiaoguang avatar Dec 23 '19 16:12 sunxiaoguang

I don't have time for this but you can reproduce on your own by running

ENGINES=rucene-0.1 make index

in the search benchmark project... https://github.com/tantivy-search/search-benchmark-game

fulmicoton avatar Dec 24 '19 00:12 fulmicoton

Sure, let me try it out

sunxiaoguang avatar Dec 24 '19 00:12 sunxiaoguang

@fulmicoton, It is a a 2GB limit with using i32. We will fix it soon.

jtong11 avatar Dec 27 '19 03:12 jtong11