tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Expull: replace read_to_end with iterator over bytes

Open PSeitz opened this issue 3 years ago • 3 comments

Expotential Unrolled List read_to_end in expull may consume a lot of memory. Since it is used by the postinglist record, it contains all docids(+optional positions, term frequencies) for one term, replace copy with iterator

PSeitz avatar Mar 20 '22 04:03 PSeitz

Codecov Report

Merging #1319 (f453a6f) into main (46d5de9) will increase coverage by 0.02%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1319      +/-   ##
==========================================
+ Coverage   94.25%   94.27%   +0.02%     
==========================================
  Files         232      232              
  Lines       40801    40790      -11     
==========================================
- Hits        38457    38456       -1     
+ Misses       2344     2334      -10     
Impacted Files Coverage Δ
common/src/lib.rs 89.33% <ø> (ø)
common/src/vint.rs 92.34% <100.00%> (-0.01%) :arrow_down:
src/postings/recorder.rs 98.26% <100.00%> (-0.30%) :arrow_down:
src/postings/stacker/expull.rs 99.10% <100.00%> (+0.05%) :arrow_up:
src/store/index/mod.rs 98.37% <0.00%> (+0.54%) :arrow_up:
src/indexer/segment_updater.rs 95.93% <0.00%> (+1.04%) :arrow_up:
src/fastfield/serializer/mod.rs 92.75% <0.00%> (+1.44%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 46d5de9...f453a6f. Read the comment docs.

codecov-commenter avatar Mar 20 '22 04:03 codecov-commenter

I prefer to work with a &[u8]. It makes it much easier to optimize things.

Did you observe a performance regression / improvement? Did it shave off the memory peaks you observed before during indexing?

fulmicoton avatar Mar 21 '22 01:03 fulmicoton

Did you observe a performance regression / improvement? Did it shave off the memory peaks you observed before during indexing?

I didn't see an impact on indexing performance.

I noticed a big chunk (33.6MB) due to read_to_end, which was gone. This also makes sense, the posting lists can get huge for some terms.

I prefer to work with a &[u8]. It makes it much easier to optimize things.

For decompression a single vint an iterator was already used. I agree, on &[u8] is better to optimize, what I would prefer here is to complete one block on the correct vint bounds, so that we can create an vint iterator over the blocks.

PSeitz avatar Mar 22 '22 02:03 PSeitz