raft
raft copied to clipboard
Add UCXX support
Add support for UCXX. It is our intention to soon switch from UCX-Py to UCXX and archive the former.
This PR adds support for UCXX on the C++ backend but retains the original UCX implementation for now (based on the UCP layer), moving to UCXX will simplify RAFT code a bit given the UCXX implementation requires fewer lines of boilerplate code.
On the Python front raft-dask tests are added for both UCX-Py (which there weren't any) and UCXX. The UCX-Py tests continue to use the UCX (UCP layer) implementation, whereas the UCXX tests use the UCXX C++ implementation.
When the switch is complete we can remove all previous UCX/UCX-Py code from the RAFT codebase. If for some reason using the UCX (UCP layer) is preferred on the C++ backend instead of the UCXX C++ implementation this is possible, but UCX-Py code will be archived and dropped in favor of the UCXX Python backend.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
@pentschev are you ready for this to be merged by 24.02?
@pentschev are you ready for this to be merged by 24.02?
I'm hoping for that but we still need UCXX wheels which is still ongoing work. I can't say whether they'll be ready yet. On the UCXX side this PR is working as expected locally so if we're able to get wheels built I think we won't have much trouble getting this in.
It looks like ucx's config is unfortunately not idempotent. We'll need to implement a workaround for that. Otherwise everything is looking solid though, pending tests of raft-dask wheels to see how well ucxx works at runtime.
It looks like ucx's config is unfortunately not idempotent. We'll need to implement a workaround for that. Otherwise everything is looking solid though, pending tests of raft-dask wheels to see how well ucxx works at runtime.
@vyasr Can you elaborate in layman's terms what that means in this context? Is this something we can fix in UCXX or not possible given the need/use of specific properties in UCXX?
Thanks so much for the help in pushing this forward!
I'm not yet sure how to fix this in UCXX, will think on it this morning. There's probably some way it can be done. The longer-term fix really ought to be in UCX itself.
The failure in conda Python builds is fixable in UCXX's CMake.
Since we got blocked on releasing UCXX wheel for 24.02 this cannot make it either, retargeting to 24.04.
OK, I was able to fix the issues by modifying UCXX. We'll eventually need to undo those fixes once UCX is patched (see https://github.com/rapidsai/ucxx/issues/173), but until then what we have here should work.
OK, I was able to fix the issues by modifying UCXX. We'll eventually need to undo those fixes once UCX is patched (see rapidsai/ucxx#173), but until then what we have here should work.
Thanks a lot Vyas! Those changes probably won't make their way in until UCX 1.17.x, maybe 1.16.x but I'll have to check. So I'll keep an eye on that and revert when we move to newer UCX.
@cjnolet is it too late to review this for 24.04 ? We'll retarget for 24.06 but we'd still like your review before merging in
I only reviewed the CMake and versioning/dependency updates. They seem fine to me, but there are pending conflicts to resolve in
ci/release/update-version.sh
andpython/raft-dask/CMakeLists.txt
. This should probably be retargeted to 24.06, since the 24.04 release is happening this week, so I'll hold off on approving until the versions are changed.
Thanks @bdice , branch was retargeted, conflicts are resolved and versions updated. 🙂
The CI failures are not problems with the PR but rather recent changes in rapids-dask-dependency
/dask-cuda
. We're discussing how to better address it (short-term probably means adding dask[dataframe]
as a dask-cuda
dependency) and we should have a solution to make tests pass by tomorrow.
I think these changes look fine but the real telling piece will be if cuML and cugraph both pass CI with the change.
I think these changes look fine but the real telling piece will be if cuML and cugraph both pass CI with the change.
Agreed, once this PR is merged then https://github.com/rapidsai/cuml/pull/5697 can also be merged so we can have a better overview of how things are looking.
The error is now gone and all tests are passing again, so should be good for a review. @divyegala mentioned he would do that, no absolutely immediate pressure but would be good if we could have this merged some time this week or next the latest.
/merge