Add RangeIterator#intersect to optimize RangeIntersectionIterator
An attempt to optimize the RangeItersectionIterator by introducing a new abstraction and an intersect method.
Quality Gate passed
The SonarCloud Quality Gate passed, but some issues were introduced.
6 New issues
0 Security Hotspots
82.0% Coverage on New Code
6.7% Duplication on New Code
So we have a 14% win on the hornet lat/lon simulated queries and a bunch of changes that could be responsible:
- the nextPosting vs advance optimization in PLRI
- adding intersect() to avoid compareTo
- more?
I suspect that (2) doesn't actually buy us much since compareTo ends up getting called anyway. But, I could be wrong.
After spending the half a day working on this, I'm not entirely sure that this PR provides the impact I thought it did.
At a high level, my motivation for thinking we can do better than the current implementation is:
- The
PostingListRangeIteratorhas enough information to determine if theskipToprimary key is an exact match, and in an intersection, that should give us enough information to avoid materializing the known-to-match primary key. For low cardinality cases, this is valuable. The main drawback to this implementation is the wayexactRowIdOrInvertedCeilingis implemented. It is not optimal for Primary Keys with clustering columns. RangeUnionIteratorcontinues to compare Primary Keys even after we have found a match. This feels inefficient, though it might be necessary to properly de-duplicate results.
In working on this today, I realized we might be able to make a simpler solution work. This PR https://github.com/datastax/cassandra/pull/990 uses exactRowIdOrInvertedCeiling to return the same skipTo token without creating a new one, which should speed up comparison because we won't materialize the matched PrimaryKey more than once.
Finally, I think the main challenge is to know how to measure these optimizations. I think you've build some valuable comparison tooling @pkolaczk. Are you able to share that?
@pkolaczk - this is ready for re-review. It is passing tests. If you're able, it'd be interesting to know what the performance improvements look like on your test data set.