raft icon indicating copy to clipboard operation
raft copied to clipboard

Add UCXX support

Open pentschev opened this issue 1 year ago • 16 comments

Add support for UCXX. It is our intention to soon switch from UCX-Py to UCXX and archive the former.

This PR adds support for UCXX on the C++ backend but retains the original UCX implementation for now (based on the UCP layer), moving to UCXX will simplify RAFT code a bit given the UCXX implementation requires fewer lines of boilerplate code.

On the Python front raft-dask tests are added for both UCX-Py (which there weren't any) and UCXX. The UCX-Py tests continue to use the UCX (UCP layer) implementation, whereas the UCXX tests use the UCXX C++ implementation.

When the switch is complete we can remove all previous UCX/UCX-Py code from the RAFT codebase. If for some reason using the UCX (UCP layer) is preferred on the C++ backend instead of the UCXX C++ implementation this is possible, but UCX-Py code will be archived and dropped in favor of the UCXX Python backend.

pentschev avatar Nov 09 '23 22:11 pentschev

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Dec 12 '23 19:12 copy-pr-bot[bot]

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Dec 12 '23 19:12 copy-pr-bot[bot]

@pentschev are you ready for this to be merged by 24.02?

cjnolet avatar Jan 17 '24 15:01 cjnolet

@pentschev are you ready for this to be merged by 24.02?

I'm hoping for that but we still need UCXX wheels which is still ongoing work. I can't say whether they'll be ready yet. On the UCXX side this PR is working as expected locally so if we're able to get wheels built I think we won't have much trouble getting this in.

pentschev avatar Jan 17 '24 16:01 pentschev

It looks like ucx's config is unfortunately not idempotent. We'll need to implement a workaround for that. Otherwise everything is looking solid though, pending tests of raft-dask wheels to see how well ucxx works at runtime.

vyasr avatar Jan 19 '24 05:01 vyasr

It looks like ucx's config is unfortunately not idempotent. We'll need to implement a workaround for that. Otherwise everything is looking solid though, pending tests of raft-dask wheels to see how well ucxx works at runtime.

@vyasr Can you elaborate in layman's terms what that means in this context? Is this something we can fix in UCXX or not possible given the need/use of specific properties in UCXX?

Thanks so much for the help in pushing this forward!

pentschev avatar Jan 19 '24 08:01 pentschev

I'm not yet sure how to fix this in UCXX, will think on it this morning. There's probably some way it can be done. The longer-term fix really ought to be in UCX itself.

The failure in conda Python builds is fixable in UCXX's CMake.

vyasr avatar Jan 19 '24 14:01 vyasr

Since we got blocked on releasing UCXX wheel for 24.02 this cannot make it either, retargeting to 24.04.

pentschev avatar Jan 20 '24 15:01 pentschev

OK, I was able to fix the issues by modifying UCXX. We'll eventually need to undo those fixes once UCX is patched (see https://github.com/rapidsai/ucxx/issues/173), but until then what we have here should work.

vyasr avatar Jan 23 '24 20:01 vyasr

OK, I was able to fix the issues by modifying UCXX. We'll eventually need to undo those fixes once UCX is patched (see rapidsai/ucxx#173), but until then what we have here should work.

Thanks a lot Vyas! Those changes probably won't make their way in until UCX 1.17.x, maybe 1.16.x but I'll have to check. So I'll keep an eye on that and revert when we move to newer UCX.

pentschev avatar Jan 24 '24 09:01 pentschev

@cjnolet is it too late to review this for 24.04 ? We'll retarget for 24.06 but we'd still like your review before merging in

quasiben avatar Mar 13 '24 12:03 quasiben

I only reviewed the CMake and versioning/dependency updates. They seem fine to me, but there are pending conflicts to resolve in ci/release/update-version.sh and python/raft-dask/CMakeLists.txt. This should probably be retargeted to 24.06, since the 24.04 release is happening this week, so I'll hold off on approving until the versions are changed.

Thanks @bdice , branch was retargeted, conflicts are resolved and versions updated. 🙂

pentschev avatar Apr 08 '24 15:04 pentschev

The CI failures are not problems with the PR but rather recent changes in rapids-dask-dependency/dask-cuda. We're discussing how to better address it (short-term probably means adding dask[dataframe] as a dask-cuda dependency) and we should have a solution to make tests pass by tomorrow.

pentschev avatar Apr 08 '24 21:04 pentschev

I think these changes look fine but the real telling piece will be if cuML and cugraph both pass CI with the change.

cjnolet avatar Apr 09 '24 01:04 cjnolet

I think these changes look fine but the real telling piece will be if cuML and cugraph both pass CI with the change.

Agreed, once this PR is merged then https://github.com/rapidsai/cuml/pull/5697 can also be merged so we can have a better overview of how things are looking.

pentschev avatar Apr 09 '24 07:04 pentschev

The error is now gone and all tests are passing again, so should be good for a review. @divyegala mentioned he would do that, no absolutely immediate pressure but would be good if we could have this merged some time this week or next the latest.

pentschev avatar Apr 09 '24 16:04 pentschev

/merge

vyasr avatar May 06 '24 22:05 vyasr