faiss icon indicating copy to clipboard operation
faiss copied to clipboard

Migrate from RAFT to CUVS

Open tarang-jain opened this issue 1 year ago • 20 comments

Remove the dependency on raft::compiled and modify GPU implementations to use cuVS backend in place of RAFT.

A deeper insight into the dependency: FAISS gets the ANN algorithm implementations such as IVF-Flat and IVF-PQ from cuVS. RAFT is meant to be a lightweight C++ header-only template library that cuVS relies on for the more fundamental / low-level utilities. Some examples of these are RAFT's device mdarray and mdspan objects; the RAFT resource object (raft::resource) that takes care of the stream ordering of device functions; linear algebra functions such as mapping, reduction, BLAS routines etc. A lot of the cuVS functions take the RAFT mdspan objects as arguments (for example raft::device_matrix_view). Therefore FAISS relies on both cuVS and RAFT. FAISS gets RAFT headers through cuVS and uses them to create the function arguments that can be consumed by cuVS. Note that we are not explicitly linking FAISS against raft::raft or raft::compiled. Only the required headers are included and compiled rather than compiling the whole RAFT shared library. This is the reason we still see mentions of raft in FAISS.

tarang-jain avatar Jun 25 '24 15:06 tarang-jain

Hi @tarang-jain! Thanks for the PR. I quickly skimmed through the PR and noticed there are still a lot of references to RAFT. Is the plan to completely remove all references to RAFT and replace with cuVS? For example, try grep -r -nH -i raft . in your local checkout.

asadoughi avatar Aug 06 '24 17:08 asadoughi

@asadoughi thanks for your comment. I have updated the PR description!

tarang-jain avatar Aug 20 '24 21:08 tarang-jain

Hi @tarang-jain - it looks like the test fails on https://github.com/facebookresearch/faiss/actions/runs/10587022390/job/29337315242#step:3:2446 related to a core dump. Are you able to debug the issue?

asadoughi avatar Aug 28 '24 20:08 asadoughi

@asadoughi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Aug 30 '24 19:08 facebook-github-bot

Regarding the segfault, consider rebasing off of the latest main to see if #3806 has an effect.

asadoughi avatar Aug 30 '24 23:08 asadoughi

@asadoughi I was able to resolve the seg fault. It was related to some bugs in the cuvs parts of bfKnn.

tarang-jain avatar Sep 05 '24 06:09 tarang-jain

@asadoughi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Sep 16 '24 13:09 facebook-github-bot

It looks like one test is still failing TestTorchUtilsGPU.test_train_add_with_ids. Have you had a chance to troubleshoot?

asadoughi avatar Sep 30 '24 22:09 asadoughi

@asadoughi yes I have been looking into it. It is quite strange. The following order of operations gives erroneous results: create a GPU torch tensor --> train GPU IVF-Flat Index --> search the index --> reset the index --> train again with the same tensor but this time have the data on CPU --> train the GPU IVF-Flat index --> search the index.

However, if both the tensors are on the same device (either CPU or GPU), there is no failure and it works fine.

tarang-jain avatar Sep 30 '24 23:09 tarang-jain

I tried addressing the current error message Error while loading conda entry point: conda-libmamba-solver (module 'libmambapy' has no attribute 'QueryFormat') in #3904 by pinning the libmamba and libmambapy installations to <2, but that led to another error with rapids-cmake. Have you seen this error with rapids-cmake before?

asadoughi avatar Oct 02 '24 18:10 asadoughi

@asadoughi the error you pointed out is in the RAFT build. Not sure if it is relevant to this PR.

tarang-jain avatar Oct 02 '24 20:10 tarang-jain

It looks like this PR is getting the same error now as well in the latest run: failing on this line.

asadoughi avatar Oct 03 '24 22:10 asadoughi

Yes, I asked Robert, our CMake expert to look at this. I think he is busy with the 24.10 release deadlines, but I'm hoping he can give us a fix in the next few days.

tarang-jain avatar Oct 04 '24 01:10 tarang-jain

Hi @tarang-jain - any update or ETA on the CMake issue? This is also causing failures on the main branch for FAISS (e.g. blocking PRs) so if we don't have an ETA to a fix I am leaning towards disabling the RAFT CI on main and it can be re-instated with this cuVS PR. Thanks!

asadoughi avatar Oct 07 '24 18:10 asadoughi

Hi @tarang-jain - any update or ETA on the CMake issue?

CMake maintainer here. My plan is to open a do-not-merge PR with the same content as this one, but which builds CMake from source from a specific commit. I'm going to use this to bisect which commit of CMake caused the regression.

KyleFromNVIDIA avatar Oct 08 '24 14:10 KyleFromNVIDIA

I've opened https://github.com/facebookresearch/faiss/pull/3925. Please approve it so I can start the bisect in CI.

KyleFromNVIDIA avatar Oct 08 '24 15:10 KyleFromNVIDIA

@asadoughi can the system glibc version be updated for running the cuvs github action? The cmake seg fault went away when I downgrade to cmake 3.26.4, but I see this error https://github.com/facebookresearch/faiss/actions/runs/11239339471/job/31246908895?pr=3549 which might mean a libc version mismatch.

tarang-jain avatar Oct 08 '24 19:10 tarang-jain

AFAICT, this CI is running on ubuntu-latest so you can try amending the shell commands in build_cmake/action.yaml to run ldd --version and the relevant sudo apt-get -y install command to install the specific glibc version.

asadoughi avatar Oct 08 '24 22:10 asadoughi

@asadoughi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Oct 13 '24 19:10 facebook-github-bot

@asadoughi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Oct 16 '24 00:10 facebook-github-bot

@asadoughi @cjnolet, due to the segmentation fault in the cagra tests with cuvs 24.10, I have reverted to cuVS 24.08. I will continue to work on getting 24.10 in on the side. But it is crucial to get this PR merged quickly.

tarang-jain avatar Nov 05 '24 06:11 tarang-jain

@tarang-jain Did you mean to push those latest 4 commits to this PR? If we're rolling forward with 24.08 we should probably go back to 3e056ed.

asadoughi avatar Nov 06 '24 18:11 asadoughi

@asadoughi I have fixed the segmentation fault occurring with 24.10 in the CAGRA tests and so we are very close to getting cuVS 24.10 in. There are still some std::bad_cast in the torch CPU tests. If I am not able to fix those by EOD today, I will open a follow-up PR with 24.10 and revert this PR to https://github.com/facebookresearch/faiss/pull/3549/commits/3e056ed911e73c998794de0342f7662dc5db3c01

tarang-jain avatar Nov 07 '24 10:11 tarang-jain

I have downgraded cuVS to 24.08 again. I'm working on https://github.com/facebookresearch/faiss/pull/4021 to get 24.10 in after this PR is merged.

tarang-jain avatar Nov 07 '24 13:11 tarang-jain

@asadoughi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Nov 07 '24 17:11 facebook-github-bot

@asadoughi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Nov 13 '24 23:11 facebook-github-bot

@asadoughi what are the facebook internal checks that are failing?

tarang-jain avatar Nov 14 '24 04:11 tarang-jain

@asadoughi merged this pull request in facebookresearch/faiss@134922061ce6126cfca9eada5a6865c3e1806d74.

facebook-github-bot avatar Nov 14 '24 19:11 facebook-github-bot