tantivy
tantivy copied to clipboard
Indexing an array of small pieces of text is much slower than indexing a big string
I still don't know if there is room for optimization there but the performance drop is significant. I observed on a dataset with documents containing a big array of small text, a drop of 25% in indexing throughput.
I pushed on a branch a file hdfs-with-array.json (hdfs json with a body made of array of text) to reproduce the performance drop in the bench. Here are the result:
Benchmarking index-hdfs/index-hdfs-no-commit: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 11.4s, or reduce sample count to 10
index-hdfs/index-hdfs-no-commit
time: [538.24 ms 541.50 ms 545.32 ms]
change: [-3.6931% -2.4507% -1.2364%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
1 (5.00%) high mild
Benchmarking index-hdfs/index-hdfs-with-array-no-commit: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 17.7s, or reduce sample count to 10.
index-hdfs/index-hdfs-with-array-no-commit
time: [879.80 ms 883.73 ms 888.34 ms]
change: [-2.6182% -1.1315% +0.0697%] (p = 0.12 > 0.05)
No change in performance detected.
Found 2 outliers among 20 measurements (10.00%)
Here is the branch: https://github.com/quickwit-oss/tantivy/tree/fmassot--bench-hdfs-with-array
here's a diff on the profile.
There's some noise, but we can see the alloc overhead for the array (array itself + multiple strings), slower deserialization, slower value_from_json.
9.61% +4.74% libc.so.6 [.] 0x000000000005a0d9 could be another alloc call, e.g. calloc.
Check IO/CPU overload!
# Event 'cycles'
#
# Baseline Delta Abs Shared Object Symbol >
# ........ ......... ............................ ................................................................................................................................................................................................................................>
#
24.16% -9.51% index_bench-075aa16d259e3fd0 [.] crossbeam_channel::flavors::array::Channel<T>::send
9.61% +4.74% libc.so.6 [.] 0x000000000005a0d9
0.87% +3.53% index_bench-075aa16d259e3fd0 [.] crossbeam_channel::context::Context::wait_until
13.41% -2.98% index_bench-075aa16d259e3fd0 [.] <tantivy::postings::postings_writer::SpecializedPostingsWriter<Rec> as tantivy::postings::postings_writer::PostingsWriter>::subscribe
9.31% -2.32% index_bench-075aa16d259e3fd0 [.] core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
0.14% +2.01% index_bench-075aa16d259e3fd0 [.] <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
3.24% +1.57% index_bench-075aa16d259e3fd0 [.] <tantivy::tokenizer::simple_tokenizer::SimpleTokenStream as tantivy::tokenizer::tokenizer::TokenStream>::advance
0.96% +1.27% libc.so.6 [.] malloc
4.85% -1.20% libm.so.6 [.] 0x0000000000067970
0.38% +1.10% index_bench-075aa16d259e3fd0 [.] tantivy::schema::field_type::FieldType::value_from_json
0.36% +1.06% libc.so.6 [.] cfree
3.48% -0.98% index_bench-075aa16d259e3fd0 [.] tantivy::postings::stacker::expull::ExpUnrolledLinkedListWriter::write_u32_vint
1.14% +0.95% index_bench-075aa16d259e3fd0 [.] tantivy::schema::schema::Schema::json_object_to_doc
3.15% -0.88% index_bench-075aa16d259e3fd0 [.] murmurhash32::murmurhash2::murmurhash2
0.62% +0.63% index_bench-075aa16d259e3fd0 [.] alloc::slice::merge_sort
1.83% -0.46% libm.so.6 [.] exp
1.17% -0.43% index_bench-075aa16d259e3fd0 [.] serde_json::de::from_str
1.59% -0.34% index_bench-075aa16d259e3fd0 [.] <serde_json::read::StrRead as serde_json::read::Read>::parse_str
2.31% -0.31% index_bench-075aa16d259e3fd0 [.] <tantivy::tokenizer::lower_caser::LowerCaserTokenStream as tantivy::tokenizer::tokenizer::TokenStream>::advance
0.14% +0.31% index_bench-075aa16d259e3fd0 [.] core::hash::BuildHasher::hash_one
0.09% +0.30% index_bench-075aa16d259e3fd0 [.] <tantivy::tokenizer::simple_tokenizer::SimpleTokenizer as tantivy::tokenizer::tokenizer::Tokenizer>::token_stream
2.49% -0.29% index_bench-075aa16d259e3fd0 [.] core::str::iter::SplitInternal<P>::next
0.92% -0.25% index_bench-075aa16d259e3fd0 [.] tantivy_common::vint::serialize_vint_u32
0.05% +0.24% index_bench-075aa16d259e3fd0 [.] itertools::groupbylazy::GroupBy<K,I,F>::step
Thanks for the report.
I moved the json parsing and document building out of the benchmark and I still see almost a factor of two between array VS no array (I updated the branch).
index-hdfs/index-hdfs-no-commit
time: [467.03 ms 468.46 ms 470.16 ms]
change: [-53.083% -52.861% -52.630%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
1 (5.00%) high mild
Benchmarking index-hdfs/index-hdfs-with-array-no-commit: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 16.7s, or reduce sample count to 10.
index-hdfs/index-hdfs-with-array-no-commit
time: [810.60 ms 814.71 ms 819.59 ms]
change: [-57.361% -57.112% -56.796%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
2 (10.00%) high mild
@trinity-1686a can you have a look at this?
I've modified slightly the bench so it does not include cloning the Vec<Document> and getting an index writer.
index-hdfs/index-hdfs-no-commit
time: [5.8195 s 5.9066 s 5.9877 s]
index-hdfs/index-hdfs-with-array-no-commit
time: [8.3026 s 8.3140 s 8.3286 s]
ratio = x1.40
perf samples:
index-hdfs/index-hdfs-no-commit
4981
index-hdfs/index-hdfs-with-array-no-commit
6904
ratio = 1.39 delta=1923
most of the delta are in:
- drop_in_place/cfree (873 more samples, from 79), either dropping
Documents (not much can be done about that) orBoxTokenStreams and their inner fields - TextAnalyser::tokenStream (405 more samples, from 48), allocating
BoxToken{Stream,Filter}orString::with_capacity - TokenStream::Advance (311 more samples, from 980), almost everything is a bit slower inside, but nothing really stands out
They collectively account for 1589 of the 1923 more samples, or 82% of the difference, 66 being allocation/free.
So the TLDR is that allocation/free related operations are 10x slower when we multiply by 9.2 the number of field values, while keeping the same actual amount of raw content. And that makes allocation goes from negligible to ~20% of time spent. I guess next step will be to figure out a way to reuse things instead of dropping and creating new ones
- We could avoid the
Documentcfree overhead, by just referencing the unserialized json text blocks (similar to serde_json_borrow). It would either require lifetimes on the tantivy'sValuethough. - The tokenization could be more efficient, currently the filters are applied on every text to be tokenized
for token_filter in &self.token_filters {
token_stream = token_filter.transform(token_stream);
}
SimpleTokenizer uses a default Token, which maybe uses a too large text capacity.
impl Default for Token {
fn default() -> Token {
Token {
offset_from: 0,
offset_to: 0,
position: usize::MAX,
text: String::with_capacity(200),
position_length: 1,
}
}
}
interestingly halving the default text capacity makes both benchmarks slightly slower on my system (though possibly within margin of error). This might be very allocator dependent, I guess mine is roughly constant time wrt allocation size.
https://github.com/quickwit-oss/tantivy/pull/2062 removes the Token::default creation for each text encountered, by having a shared one on the tokenizer
index-hdfs/index-hdfs-no-commit
time: [644.80 ms 648.41 ms 652.51 ms]
change: [-4.1369% -2.7259% -1.3721%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
index-hdfs/index-hdfs-with-array-no-commit
time: [800.76 ms 803.53 ms 808.09 ms]
change: [-5.1855% -4.7265% -4.0817%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
Ideally the allocation for the BoxTokenStream per text would also be removed, but this may require some lifetime tricks.
after #2062, the main differences I see left in flamegraphs are:
- dropping the document being indexed (almost negligible with full strings, around 9% of samples for pre-split strings, creating the document is outside of my measured loop, but in practice it is probably impactful too). This may be heavily impacted by #2071
- creating and dropping boxed tokenizers (again negligible with full string, around 4% of perf samples for pre-split strings). It will probably be a pain to remove that last box
there have been substantial progress wrt this issue. It's still somewhat slower, but a lot less than before. I'm unlikely to work more on this right now, so I'm unassigning myself