Implement distributed sorted for ``cudf_polars``
This implements distributed sorting for cudf-polars, it should work but is missing new sorting tests and I am also not quite sure that local tests exercised the rapidsmpf paths properly.
Right now, it is structured around introducing a ShuffleSorted IR but maybe it would be nicer to merge the steps further.
It does some smaller refactors to the shuffling to re-use some code there (but not actually move too much sorting related logic into the file).
Main missing things:
- [ ] The biggest thing missing for sure is new tests which may flush out a bug or two.
- As a follow up maybe: I use
concatin both paths (rapidsmpfand not) for merging the exchanged chunks. If the sort isn't stable, then this should useplc...merge, this will require changingrapidsmpf, a bit more. - I have not modified
zlicehandling (top/bottom limits). For small values there is probably little point but for large result slices this needs to be threaded in.
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.
Contributors can view more details about this message here.
Moved to 25.08
@seberg @rjzamora will this resolve https://github.com/rapidsai/cudf/issues/18527?
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
@seberg - Just FYI: I addressed a few merge conflicts here to keep the work from getting stale.
Sorry for being slow to review this. FWIW, I think the PR is ready (or very close).
Do you think there is unnecessary risk in merging this into 25.08 if it passes through review quickly, or does early 24.10 sound better?
Thanks for taking a look! I am pretty confident it not returning bad results and I don't think it's a giant user base yet either way. Suppose that is to say, I am fine either way, put it in now or be on the safe side and just wait for branching (and by the time it's shipped a few parts may be optimized a bit more then).
/ok to test
/ok to test
@Matt711, there was an error processing your request: E1
See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/
/ok to test 1e2b60083637e5f2bf028a3f089dc46d3a253eea
/ok to test c343adc
/ok to test 06690966ee6690a16991824f31fe5ccde17d1941
/ok to test f26910bade46d77df8ca7ae5511ce43090d7ac71
/ok to test 0fd6db3429fbb96e6a61a2c1a278c662da0002f3
/ok to test 165726b
/ok to test 88ba934
/ok to test e6c3ecdfcc61d1221509f502a5bd8eda985699e5
/ok to test 8a117c8
/ok to test 3053c65
/ok to test 5687414
Small update: I used the latest version of this branch to run the PDS-H benchmark at SF-3000 on 8xH100 (5 iterations for each query). The run times were consistent with branch-25.10 (so no obvious problems).
Is this PR waiting on a review at this point?
Is this PR waiting on a review at this point?
It would be good to get another review from @Matt711 since he asked for changes.
Is this PR waiting on a review at this point?
It would be good to get another review from @Matt711 since he asked for changes.
enqueued
Small update: I used the latest version of this branch to run the PDS-H benchmark at SF-3000 on 8xH100 (5 iterations for each query). The run times were consistent with
branch-25.10(so no obvious problems).
Do any of the pdsh queries currently (without this PR) fallback because Sort does not support multiple partitions. at SF3000?
/ok to test 23d5cda
/ok to test 7e89c87
Small update: I used the latest version of this branch to run the PDS-H benchmark at SF-3000 on 8xH100 (5 iterations for each query). The run times were consistent with branch-25.10 (so no obvious problems).
Do any of the pdsh queries currently (without this PR) fallback because Sort does not support multiple partitions. at SF3000?
@Matt711 - Sorry I missed this earlier. Query 13 does hit the fall-back logic at sf3k. I don't think any other queries will use the multi-partition sort. I can do one last test on an 8xH100 cluster before we merge this.
Small update: I used the latest version of this branch to run the PDS-H benchmark at SF-3000 on 8xH100 (5 iterations for each query). The run times were consistent with branch-25.10 (so no obvious problems).
Do any of the pdsh queries currently (without this PR) fallback because Sort does not support multiple partitions. at SF3000?
@Matt711 - Sorry I missed this earlier. Query 13 does hit the fall-back logic at sf3k. I don't think any other queries will use the multi-partition sort. I can do one last test on an 8xH100 cluster before we merge this.
I did too 😅. Thanks!
/merge
Thanks @seberg! (Sorry for taking so long to review)