ARROW-14656: [Python] Add sort helper function for Array, ChunkedArray and StructArray
This is the first PR for https://github.com/apache/arrow/pull/11659 (splitting up this PR into multiple PRs). This PR adds a sort helper function for the array, chunked_array and array of structs (StructArray) and chunked array of structs sorting.
A NestedValuesComparator is added which is planned to be a common comparator for anything that has nested fields and columns. Follow up PRs will update the Table, RecordBatch etc sorting implementations to use this NestedValuesComparator.
https://issues.apache.org/jira/browse/ARROW-14656
Thanks @Jedi18 for attempting this. At a high level, two things:
-
Can you first tackle just the C++ side and write tests on that side?
Follow up PRs will update the Table, RecordBatch etc sorting implementations to use this NestedValuesComparator.
Did you actually measure performance? I expect this to be slower that the current RecordBatch and Table sorting.
@Jedi18 do you have time to update this?
@jorisvandenbossche so sorry about the late reply, I missed the github emails and hadn't checked this PR in a long while. Yes I'll address the review comments in the next few days when I get time.
@pitrou Sorry for the late reply, just saw the comments now
Can you first tackle just the C++ side and write tests on that side?
Sure I'll do that first. There was no straightforward way to test it on the C++ side hence the python tests. I'll look into it a bit more and see how I can test it in the unit tests.
Did you actually measure performance? I expect this to be slower that the current RecordBatch and Table sorting.
I did not, and yeah that's one of the reasons for wanting seperate PRs for those so I can at the very least mirror the current implementation and performance. Are you saying we should get those done first and measure it before we think about merging this in case it turns out to be slower than the current implementation?
Are you saying we should get those done first and measure it before we think about merging this in case it turns out to be slower than the current implementation?
Yes, this would be better. Otherwise I would suggest directly reusing the current Table/Batch sorting for Struct array sorting (the one thing missing being handling the top-level null bitmap).
For more context, the original Table/Batch sort was based on something similar to NestedValuesComparator, and I significantly improved its performance by switching to per-column sorting for cases with few keys.
You can look at the source code references for RadixRecordBatchSorter and MultipleKeyRecordBatchSorter.
Are you saying we should get those done first and measure it before we think about merging this in case it turns out to be slower than the current implementation?
Yes, this would be better. Otherwise I would suggest directly reusing the current Table/Batch sorting for Struct array sorting (the one thing missing being handling the top-level null bitmap).
To avoid mixing too many concerns in a single activity, I'd like to propose we focus this task on providing the capability and then tackle performance improvements and benchmarking into a dedicated activity. I agree we don't want to introduce any slowdown, but as far as we are talking about adding the capability to something that wasn't supported, I think we can live for a certain amount of time with having struct fields sorting slower than table sorting while we work on performance improvements for it.
This has been a feature under work for months at this point and I think it would provide immediate value to users while giving us the time to work on refining and improving it.
Obviously any work at migrating Tables and RecordBatches to the NestedValueComparator (or whatever will come) should be subordinate to ensuring the comparator is on pair with the current implementation.
I disagree about introducing non-trivial C++ code that is redundant with existing code and probably sub-performant.
I agree with @pitrou that it seems better to first check if the implementation we already have isn't actually better to reuse. Otherwise the time spent on finalizing and reviewing a new implementation might turn out to have been unnecessary.
As far as I understand, we already have a sort implementation that can handle nested (struct-like) data, which is used for Table/RecordBatch, and which could also be used for StructArray (which additionally needs handling of a top-level validity bitmap). So I think we can indeed focus first on providing the capability by first reusing whatever we already have, and then in subsequent tasks we can still further see if a new implementation could provide a speed-up for certain cases or not.
To be clear, the current multi-column sorting facilities don't handle recursive nesting, only the top level of nesting. To handle recursive nesting, you need to flatten the columns. This can be done using StructArray::Flatten.
Another thing to do add is support for a top-level null bitmap. But that should be relatively easy. The most tedious part of all this is probably the moderate refactoring required.
But I agree with quickly benchmarking the approach in this PR compared to the existing facility. It should be easy: 1) generate a RecordBatch (respectively Table); 2) convert it to a Struct Array (respectively ChunkedArray); 3) benchmark sorting one and the other using the same sorting parameters.
@benibus I know you are planning to work on the sorting, are you going to take over this PR or start fresh? My understanding is that we rejected the approach in this PR because of probable performance regressions. So if you plan to start fresh we can probably close this one
@amol- Yeah, the plan is to open a new PR with a different approach. I'll go ahead and close this.