datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

feat(7181): provide slicing of CursorValues

Open wiedld opened this issue 2 years ago • 7 comments

Which issue does this PR close?

Part of #7181 .

Rationale for this change

In the cascaded merge tree, partially sorted record batches are yielded per merge node. Therefore, we need the ability to yield a partial of these representative values (a.k.a. the CursorValues).

What changes are included in this PR?

Slicing the CursorValues. Add tests to demonstrate use with Cursor::new(sliced_cursor_values). Have test coverage include different data types.

Are these changes tested?

Yes, tests are added.

Are there any user-facing changes?

No. CursorValues are currently an internal construct.

wiedld avatar Oct 24 '23 02:10 wiedld

I plan to review this tomorrow morning

alamb avatar Oct 25 '23 22:10 alamb

For any questions on the big picture design, referred to the diagrams here.

Note that^^ is a drafted followup PR. I've tried to incorporate some of the naming/wordage changes suggested, but other names have not changed yet (since I'm not sure how the design diagrams may impact the recommended wordage).

wiedld avatar Oct 26 '23 15:10 wiedld

I wonder if you've thought about storing the slice in the BatchCursor, as opposed to pushing it down to this level. That way the logic would only exist once, as opposed to once for every CursorValues? https://github.com/apache/arrow-datafusion/pull/7842/files#diff-1eb407bf7bf6bcc6115e66c6a0197bba129c90328b9d6cd00a5e1eb5e74daa85R103

It could be a stupid idea though...

tustvold avatar Oct 26 '23 16:10 tustvold

We should probably confirm this doesn't regress the current sort performance, e.g. by introducing an additional memory indirection when comparing cursors, but otherwise if @alamb is happy lets just move forwards with this

tustvold avatar Oct 26 '23 16:10 tustvold

Run with the application of this slicing code (a single composable merge node) branch here.

gcp c3d-standard-8-lssd debian-11

Screenshot 2023-10-27 at 8 14 50 PM

There are further confirmation steps, as well as hypotheses, as to why this^^ is found. I'll post updates after.

wiedld avatar Oct 28 '23 03:10 wiedld

I think we are waiting on some more measurement confirmation that this PR doesn't regress performance. Marking it as draft as we do that so we don't accidentally merge this

alamb avatar Nov 01 '23 19:11 alamb

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Apr 25 '24 01:04 github-actions[bot]

I think we should close this @alamb , since it's not a priority at the moment. And whenever we circle back, there will be a very large diff (due to changes). Ok to close?

wiedld avatar Apr 29 '24 15:04 wiedld

Always good with me to close PRs (we can always reopen them later if needed)

Thanks @wiedld

alamb avatar Apr 30 '24 13:04 alamb