Migrate from RAFT to CUVS
Remove the dependency on raft::compiled and modify GPU implementations to use cuVS backend in place of RAFT.
A deeper insight into the dependency:
FAISS gets the ANN algorithm implementations such as IVF-Flat and IVF-PQ from cuVS. RAFT is meant to be a lightweight C++ header-only template library that cuVS relies on for the more fundamental / low-level utilities. Some examples of these are RAFT's device mdarray and mdspan objects; the RAFT resource object (raft::resource) that takes care of the stream ordering of device functions; linear algebra functions such as mapping, reduction, BLAS routines etc. A lot of the cuVS functions take the RAFT mdspan objects as arguments (for example raft::device_matrix_view). Therefore FAISS relies on both cuVS and RAFT. FAISS gets RAFT headers through cuVS and uses them to create the function arguments that can be consumed by cuVS. Note that we are not explicitly linking FAISS against raft::raft or raft::compiled. Only the required headers are included and compiled rather than compiling the whole RAFT shared library. This is the reason we still see mentions of raft in FAISS.
Hi @tarang-jain! Thanks for the PR. I quickly skimmed through the PR and noticed there are still a lot of references to RAFT. Is the plan to completely remove all references to RAFT and replace with cuVS? For example, try grep -r -nH -i raft . in your local checkout.
@asadoughi thanks for your comment. I have updated the PR description!
Hi @tarang-jain - it looks like the test fails on https://github.com/facebookresearch/faiss/actions/runs/10587022390/job/29337315242#step:3:2446 related to a core dump. Are you able to debug the issue?
@asadoughi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Regarding the segfault, consider rebasing off of the latest main to see if #3806 has an effect.
@asadoughi I was able to resolve the seg fault. It was related to some bugs in the cuvs parts of bfKnn.
@asadoughi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
It looks like one test is still failing TestTorchUtilsGPU.test_train_add_with_ids. Have you had a chance to troubleshoot?
@asadoughi yes I have been looking into it. It is quite strange. The following order of operations gives erroneous results:
create a GPU torch tensor --> train GPU IVF-Flat Index --> search the index --> reset the index --> train again with the same tensor but this time have the data on CPU --> train the GPU IVF-Flat index --> search the index.
However, if both the tensors are on the same device (either CPU or GPU), there is no failure and it works fine.
I tried addressing the current error message Error while loading conda entry point: conda-libmamba-solver (module 'libmambapy' has no attribute 'QueryFormat') in #3904 by pinning the libmamba and libmambapy installations to <2, but that led to another error with rapids-cmake. Have you seen this error with rapids-cmake before?
@asadoughi the error you pointed out is in the RAFT build. Not sure if it is relevant to this PR.
It looks like this PR is getting the same error now as well in the latest run: failing on this line.
Yes, I asked Robert, our CMake expert to look at this. I think he is busy with the 24.10 release deadlines, but I'm hoping he can give us a fix in the next few days.
Hi @tarang-jain - any update or ETA on the CMake issue? This is also causing failures on the main branch for FAISS (e.g. blocking PRs) so if we don't have an ETA to a fix I am leaning towards disabling the RAFT CI on main and it can be re-instated with this cuVS PR. Thanks!
Hi @tarang-jain - any update or ETA on the CMake issue?
CMake maintainer here. My plan is to open a do-not-merge PR with the same content as this one, but which builds CMake from source from a specific commit. I'm going to use this to bisect which commit of CMake caused the regression.
I've opened https://github.com/facebookresearch/faiss/pull/3925. Please approve it so I can start the bisect in CI.
@asadoughi can the system glibc version be updated for running the cuvs github action? The cmake seg fault went away when I downgrade to cmake 3.26.4, but I see this error https://github.com/facebookresearch/faiss/actions/runs/11239339471/job/31246908895?pr=3549 which might mean a libc version mismatch.
AFAICT, this CI is running on ubuntu-latest so you can try amending the shell commands in build_cmake/action.yaml to run ldd --version and the relevant sudo apt-get -y install command to install the specific glibc version.
@asadoughi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@asadoughi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@asadoughi @cjnolet, due to the segmentation fault in the cagra tests with cuvs 24.10, I have reverted to cuVS 24.08. I will continue to work on getting 24.10 in on the side. But it is crucial to get this PR merged quickly.
@tarang-jain Did you mean to push those latest 4 commits to this PR? If we're rolling forward with 24.08 we should probably go back to 3e056ed.
@asadoughi I have fixed the segmentation fault occurring with 24.10 in the CAGRA tests and so we are very close to getting cuVS 24.10 in. There are still some std::bad_cast in the torch CPU tests. If I am not able to fix those by EOD today, I will open a follow-up PR with 24.10 and revert this PR to https://github.com/facebookresearch/faiss/pull/3549/commits/3e056ed911e73c998794de0342f7662dc5db3c01
I have downgraded cuVS to 24.08 again. I'm working on https://github.com/facebookresearch/faiss/pull/4021 to get 24.10 in after this PR is merged.
@asadoughi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@asadoughi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@asadoughi what are the facebook internal checks that are failing?
@asadoughi merged this pull request in facebookresearch/faiss@134922061ce6126cfca9eada5a6865c3e1806d74.