cudf Implement distributed sorted for ``cudf

This implements distributed sorting for cudf-polars, it should work but is missing new sorting tests and I am also not quite sure that local tests exercised the rapidsmpf paths properly.

Right now, it is structured around introducing a ShuffleSorted IR but maybe it would be nicer to merge the steps further.

It does some smaller refactors to the shuffling to re-use some code there (but not actually move too much sorting related logic into the file).

Main missing things:

[ ] The biggest thing missing for sure is new tests which may flush out a bug or two.
As a follow up maybe: I use concat in both paths (rapidsmpf and not) for merging the exchanged chunks. If the sort isn't stable, then this should use plc...merge, this will require changing rapidsmpf, a bit more.
I have not modified zlice handling (top/bottom limits). For small values there is probably little point but for large result slices this needs to be threaded in.

May 21 '25 10:05 seberg

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

May 21 '25 10:05 copy-pr-bot[bot]

Moved to 25.08

May 21 '25 17:05 vyasr

@seberg @rjzamora will this resolve https://github.com/rapidsai/cudf/issues/18527?

May 22 '25 22:05 vyasr

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

May 30 '25 17:05 copy-pr-bot[bot]

@seberg - Just FYI: I addressed a few merge conflicts here to keep the work from getting stale.

Sorry for being slow to review this. FWIW, I think the PR is ready (or very close).

Do you think there is unnecessary risk in merging this into 25.08 if it passes through review quickly, or does early 24.10 sound better?

Jul 16 '25 16:07 rjzamora

Thanks for taking a look! I am pretty confident it not returning bad results and I don't think it's a giant user base yet either way. Suppose that is to say, I am fine either way, put it in now or be on the safe side and just wait for branching (and by the time it's shipped a few parts may be optimized a bit more then).

Jul 16 '25 17:07 seberg

/ok to test

Jul 17 '25 16:07 Matt711

/ok to test

@Matt711, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

Jul 17 '25 16:07 copy-pr-bot[bot]

/ok to test 1e2b60083637e5f2bf028a3f089dc46d3a253eea

Jul 17 '25 16:07 Matt711

/ok to test c343adc

Jul 21 '25 13:07 rjzamora

/ok to test 06690966ee6690a16991824f31fe5ccde17d1941

Jul 21 '25 16:07 vyasr

/ok to test f26910bade46d77df8ca7ae5511ce43090d7ac71

Aug 20 '25 19:08 Matt711

/ok to test 0fd6db3429fbb96e6a61a2c1a278c662da0002f3

Aug 21 '25 01:08 Matt711

/ok to test 165726b

Aug 21 '25 03:08 rjzamora

/ok to test 88ba934

Aug 21 '25 14:08 rjzamora

/ok to test e6c3ecdfcc61d1221509f502a5bd8eda985699e5

Aug 28 '25 13:08 Matt711

/ok to test 8a117c8

Aug 29 '25 16:08 rjzamora

/ok to test 3053c65

Aug 29 '25 19:08 rjzamora

/ok to test 5687414

Aug 29 '25 20:08 rjzamora

Small update: I used the latest version of this branch to run the PDS-H benchmark at SF-3000 on 8xH100 (5 iterations for each query). The run times were consistent with branch-25.10 (so no obvious problems).

Aug 29 '25 21:08 rjzamora

Is this PR waiting on a review at this point?

Sep 04 '25 01:09 vyasr

Is this PR waiting on a review at this point?

It would be good to get another review from @Matt711 since he asked for changes.

Sep 04 '25 12:09 rjzamora

Is this PR waiting on a review at this point?

It would be good to get another review from @Matt711 since he asked for changes.

enqueued

Sep 04 '25 14:09 Matt711

Small update: I used the latest version of this branch to run the PDS-H benchmark at SF-3000 on 8xH100 (5 iterations for each query). The run times were consistent with branch-25.10 (so no obvious problems).

Do any of the pdsh queries currently (without this PR) fallback because Sort does not support multiple partitions. at SF3000?

Sep 04 '25 15:09 Matt711

/ok to test 23d5cda

Sep 08 '25 19:09 rjzamora

/ok to test 7e89c87

Sep 09 '25 13:09 rjzamora

Small update: I used the latest version of this branch to run the PDS-H benchmark at SF-3000 on 8xH100 (5 iterations for each query). The run times were consistent with branch-25.10 (so no obvious problems).

Do any of the pdsh queries currently (without this PR) fallback because Sort does not support multiple partitions. at SF3000?

@Matt711 - Sorry I missed this earlier. Query 13 does hit the fall-back logic at sf3k. I don't think any other queries will use the multi-partition sort. I can do one last test on an 8xH100 cluster before we merge this.

Sep 09 '25 18:09 rjzamora

Small update: I used the latest version of this branch to run the PDS-H benchmark at SF-3000 on 8xH100 (5 iterations for each query). The run times were consistent with branch-25.10 (so no obvious problems).

Do any of the pdsh queries currently (without this PR) fallback because Sort does not support multiple partitions. at SF3000?

@Matt711 - Sorry I missed this earlier. Query 13 does hit the fall-back logic at sf3k. I don't think any other queries will use the multi-partition sort. I can do one last test on an 8xH100 cluster before we merge this.

I did too 😅. Thanks!

Sep 09 '25 18:09 Matt711

/merge

Sep 09 '25 19:09 rjzamora

Thanks @seberg! (Sorry for taking so long to review)

Sep 09 '25 19:09 rjzamora

Implement distributed sorted for ``cudf_polars``