lucene icon indicating copy to clipboard operation
lucene copied to clipboard

LUCENE-10616: optimizing decompress when only retrieving some fields

Open JoeHF opened this issue 2 years ago • 4 comments

Description (or a Jira issue link if you have one)

change decompress api in Decompressor from returning bytes to InputStream in order to implement lazy decompression. Lazy decompression gives us a chance to skip unneeded fields. Especially when the use case is users don't want the large size stored fields.

The key optimization happened in Skip method which didn't decompress data but bypass unneeded compressed bytes according its length. Originally these unneeded bytes are also decompressed.

jira: https://issues.apache.org/jira/browse/LUCENE-10616

JoeHF avatar Jul 03 '22 08:07 JoeHF

hi @jpountz just implemented decompress api that returned Inputstream in LZ4WithPresetDictDecompressor. Would you take a quick look at the changing direction and give some advice? The next step is to implement other Decompressor to return InputStream.

JoeHF avatar Jul 08 '22 15:07 JoeHF

thanks @jpountz for reviewing and advice, this commit forks codes to Lucene90 and left only one variant that returned InputStream Could you help take another look and see if I understanding right? Also left two comments that want to have your opinions on those.

JoeHF avatar Jul 14 '22 11:07 JoeHF

https://github.com/apache/lucene/pull/1003/commits/4b9086fc1bbb31f0ca36986f3adaa770665215e1 found alternatives that we can skip non needed compressed bytes by reading compressed length. This will significantly decrease decompression time when we only want several fields.

JoeHF avatar Jul 21 '22 16:07 JoeHF

no obvious regression or perf improvement, guess there are no such cases in benchmark image

JoeHF avatar Jul 26 '22 06:07 JoeHF

Is this change still relevant? Or did we achieve laziness on subset of stored fields in a different way maybe? Thanks @JoeHF!

no obvious regression or perf improvement, guess there are no such cases in benchmark

Indeed Lucene's benchmarks either load all stored fields for a doc, or none, so it won't reflect the impact of this nice change.

mikemccand avatar Nov 02 '23 10:11 mikemccand

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Jan 08 '24 12:01 github-actions[bot]