lucene
lucene copied to clipboard
LUCENE-10616: optimizing decompress when only retrieving some fields
Description (or a Jira issue link if you have one)
change decompress
api in Decompressor
from returning bytes
to InputStream
in order to implement lazy decompression. Lazy decompression gives us a chance to skip unneeded fields. Especially when the use case is users don't want the large size stored fields.
The key optimization happened in Skip
method which didn't decompress data but bypass unneeded compressed bytes according its length. Originally these unneeded bytes are also decompressed.
jira: https://issues.apache.org/jira/browse/LUCENE-10616
hi @jpountz just implemented decompress
api that returned Inputstream
in LZ4WithPresetDictDecompressor
. Would you take a quick look at the changing direction and give some advice? The next step is to implement other Decompressor
to return InputStream
.
thanks @jpountz for reviewing and advice, this commit forks codes to Lucene90
and left only one variant that returned InputStream
Could you help take another look and see if I understanding right? Also left two comments that want to have your opinions on those.
https://github.com/apache/lucene/pull/1003/commits/4b9086fc1bbb31f0ca36986f3adaa770665215e1 found alternatives that we can skip non needed compressed bytes by reading compressed length. This will significantly decrease decompression time when we only want several fields.
no obvious regression or perf improvement, guess there are no such cases in benchmark
Is this change still relevant? Or did we achieve laziness on subset of stored fields in a different way maybe? Thanks @JoeHF!
no obvious regression or perf improvement, guess there are no such cases in benchmark
Indeed Lucene's benchmarks either load all stored fields for a doc, or none, so it won't reflect the impact of this nice change.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!