cassandra Add RangeIterator#intersect to optimize RangeIntersectionIterator

An attempt to optimize the RangeItersectionIterator by introducing a new abstraction and an intersect method.

Dec 11 '23 03:12 michaeljmarshall

Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

6 New issues
0 Security Hotspots
82.0% Coverage on New Code
6.7% Duplication on New Code

See analysis details on SonarCloud

Dec 13 '23 21:12 sonarqubecloud[bot]

So we have a 14% win on the hornet lat/lon simulated queries and a bunch of changes that could be responsible:

the nextPosting vs advance optimization in PLRI
adding intersect() to avoid compareTo
more?

I suspect that (2) doesn't actually buy us much since compareTo ends up getting called anyway. But, I could be wrong.

Dec 14 '23 00:12 jbellis

After spending the half a day working on this, I'm not entirely sure that this PR provides the impact I thought it did.

At a high level, my motivation for thinking we can do better than the current implementation is:

The PostingListRangeIterator has enough information to determine if the skipTo primary key is an exact match, and in an intersection, that should give us enough information to avoid materializing the known-to-match primary key. For low cardinality cases, this is valuable. The main drawback to this implementation is the way exactRowIdOrInvertedCeiling is implemented. It is not optimal for Primary Keys with clustering columns.
RangeUnionIterator continues to compare Primary Keys even after we have found a match. This feels inefficient, though it might be necessary to properly de-duplicate results.

In working on this today, I realized we might be able to make a simpler solution work. This PR https://github.com/datastax/cassandra/pull/990 uses exactRowIdOrInvertedCeiling to return the same skipTo token without creating a new one, which should speed up comparison because we won't materialize the matched PrimaryKey more than once.

Finally, I think the main challenge is to know how to measure these optimizations. I think you've build some valuable comparison tooling @pkolaczk. Are you able to share that?

Feb 13 '24 06:02 michaeljmarshall

@pkolaczk - this is ready for re-review. It is passing tests. If you're able, it'd be interesting to know what the performance improvements look like on your test data set.

Feb 16 '24 06:02 michaeljmarshall