cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

sql: make lookup-join deduplication efficient for range lookups

Open DrewKimball opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe. Currently, de-duplication of the spans used for lookup is performed in the lookup joiner logic using a go map keyed on the span start and end keys. This works well for lookup joins that only use equalities (e.g. point-lookups), since in this case spans that overlap are exactly equivalent and can therefore be deduplicated by the map. For range lookups, however, spans may only partially overlap - for example, the start keys of two spans may be the same but the end keys different. Currently, we would not perform any de-duplication in this case and would instead perform two separate lookups, which may involve a lot of redundant work.

Describe the solution you'd like First of all, de-duplication should be moved from the joinReader to the Streamer - see #82155.

For point lookups, it makes sense to use a map as before, since this is simple and imposes little overhead. However, for range lookups, a sort-merge strategy probably makes the most sense in order to eliminate redundant work. This work would only be performed once per input batch in order to de-duplicate. For matching and emitting looked-up rows to input rows, we would still use a simple multi-map slice like spanIDHelper.spanIDToInputRowIndices as before.

Additional context #66002 and #85597 added support for range-based lookup joins, but lookup joins that don't use equality conditions (only inequalities) are still only planned in limited cases in order to avoid performance regressions due to the lack of de-duplication. Fixing this issue will allow us to plan lookup joins in more cases.

Jira issue: CRDB-18498

DrewKimball avatar Aug 11 '22 09:08 DrewKimball