kuzu icon indicating copy to clipboard operation
kuzu copied to clipboard

Optimize scanning of string offsets to skip pages we don't need to read

Open benjaminwinger opened this issue 1 year ago • 5 comments

Before, we were scanning all offsets at once to reduce the number of scans done (since with dictionary compression, offsets may be out of order even when doing unfiltered scans, so there's no guarantee that they will be adjacent). That had a bad worst case where for chunks with a large dictionary, you might end up scanning the entire chunk to read just a few pages.

So instead, this scans in batches, calculated by finding the next range of values where no two values are further apart than one page (using the number of values per page calculated from the chunk metadata). It will still scan unneeded values, but no longer scan unneeded pages. This had better performance than scanning only adjacent values.

I'm concerned that this is a little complicated, but it gives a moderate improvement to string scanning performance (roughly 10% on columns with a moderate amount of duplication).

Here are two sets of benchmarks on the LDBC-10 comment table (I did two runs to make it clearer when differences were due to externalities like system noise. On one of them, the two runs are superimposed). Note that these are without clearing OS filesystem caches.

I'd also done some cold runs in the same circumstances on the locationIP, and content tables, and got a speedup of about 7% (only numbers I have to hand are 859ms average compared to 924ms average, for the content column with the string IDs; non-string IDs were slightly faster, but still a comparable difference).

Oddly, the q05 benchmark (scanning browserUsed) is faster with this change, which I think is suspicious as the browserUsed column only has 5 distinct values and should get no benefit from skipping pages.

benjaminwinger avatar Dec 12 '23 17:12 benjaminwinger