feat(7181): provide slicing of CursorValues
Which issue does this PR close?
Part of #7181 .
Rationale for this change
In the cascaded merge tree, partially sorted record batches are yielded per merge node.
Therefore, we need the ability to yield a partial of these representative values (a.k.a. the CursorValues).
What changes are included in this PR?
Slicing the CursorValues.
Add tests to demonstrate use with Cursor::new(sliced_cursor_values).
Have test coverage include different data types.
Are these changes tested?
Yes, tests are added.
Are there any user-facing changes?
No. CursorValues are currently an internal construct.
I plan to review this tomorrow morning
For any questions on the big picture design, referred to the diagrams here.
Note that^^ is a drafted followup PR. I've tried to incorporate some of the naming/wordage changes suggested, but other names have not changed yet (since I'm not sure how the design diagrams may impact the recommended wordage).
I wonder if you've thought about storing the slice in the BatchCursor, as opposed to pushing it down to this level. That way the logic would only exist once, as opposed to once for every CursorValues? https://github.com/apache/arrow-datafusion/pull/7842/files#diff-1eb407bf7bf6bcc6115e66c6a0197bba129c90328b9d6cd00a5e1eb5e74daa85R103
It could be a stupid idea though...
We should probably confirm this doesn't regress the current sort performance, e.g. by introducing an additional memory indirection when comparing cursors, but otherwise if @alamb is happy lets just move forwards with this
Run with the application of this slicing code (a single composable merge node) branch here.
gcp c3d-standard-8-lssd debian-11
There are further confirmation steps, as well as hypotheses, as to why this^^ is found. I'll post updates after.
I think we are waiting on some more measurement confirmation that this PR doesn't regress performance. Marking it as draft as we do that so we don't accidentally merge this
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.
I think we should close this @alamb , since it's not a priority at the moment. And whenever we circle back, there will be a very large diff (due to changes). Ok to close?
Always good with me to close PRs (we can always reopen them later if needed)
Thanks @wiedld