[BUG] cuVS Cagra Python API has low recall for inner product datasets
Describe the bug When testing the cuVS Python Cagra API for certain inner product datasets, I get a low recall value. I tested the following ANN datasets with k = 100:
| Dataset name | Location | Space type | Dimensions | Documents | Normalized | Recall with cuVS API | Recall with FAISS API |
|---|---|---|---|---|---|---|---|
| coherev2-dbpedia | https://huggingface.co/datasets/navneet1v/datasets/resolve/main/coherev2-dbpedia.hdf5?download=true | inner product | 4096 | 450,000 | No | 98.6% | 75.5% |
| FlickrImagesTextQueries | https://huggingface.co/datasets/navneet1v/datasets/resolve/main/FlickrImagesTextQueries.hdf5?download=true | inner product | 512 | 1,831,403 | Yes | 11.9% | 82.1% |
| marco-tasb | https://huggingface.co/datasets/navneet1v/datasets/resolve/main/marco_tasb.hdf5?download=true | inner product | 768 | 1,000,000 | No | 51.4% | 93.1% |
| cohere-768-ip | https://dbyiw3u3rf9yr.cloudfront.net/corpora/vectorsearch/cohere-wikipedia-22-12-en-embeddings/documents-1m.hdf5.bz2 | inner product | 768 | 1,000,000 | No | 12.7% | 82.6% |
I've also added the recall I get when I use the FAISS Cagra Python API. Both the cuVS and Faiss tests used 64 as the intermediate graph degree and 32 as the graph degree.
Except for coherev2-dbpedia, all datasets gave a significantly lower recall value for cuVS Python API compared with FAISS. I have not seen this issue with the L2 datasets I've tested with.
Steps/Code to reproduce bug These are the steps to reproduce the issue with cuVS Python API
- On a server with GPUs, clone https://github.com/navneet1v/VectorSearchForge
- Server must have
gitanddockerinstalled - Server must have
nvidiadeveloper tools installed, such asnvidia-smiandnvidia-container-toolkit
- Server must have
cdintocuvs_benchmarksfolder, and create a temp directory to store the logs
mkdir ./benchmarks_files
chmod 777 ./benchmarks_files
- Build the docker image:
docker build -t <your_image_name> .
- Run the image:
docker run -v ./benchmarks_files:/tmp/files --gpus all <your_image_name>
The cuVS Cagra API is called in this function: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/main.py#L303. The relevant code snippet looks like this:
logging.info(f"Running for workload {workload['dataset_name']}")
file = downloadDataSetForWorkload(workload)
d, xb, ids = prepare_indexing_dataset(file, workload["normalize"])
index_params = cagra.IndexParams(intermediate_graph_degree=64,graph_degree=32,build_algo='ivf_pq', metric="inner_product")
index = cagra.build(index_params, xb)
d, xq, gt = prepare_search_dataset(file, workload["normalize"])
xq = cp.asarray(xq)
search_params = cagra.SearchParams(itopk_size = 200)
distances, neighbors = cagra.search(search_params, index, xq, 100)
logging.info("Search is done")
neighbors = cp.asnumpy(neighbors)
logging.info(f"Recall at k=100 is : {recall_at_r(neighbors, gt, 100, 100, len(xq))}")
logging.info("Sleeping for 5 seconds")
time.sleep(5)
Expected behavior The recall should be > 80% for all of the datasets.
Environment details (please complete the following information):
- Environment location: AWS EC2 g5.2xlarge, with Deep Learning Base AMI.
- Type of GPU: 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
- Method of RAFT install: conda, Docker
- cuVS and rapids are installed in this line of the Dockerfile: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/Dockerfile#L5
Additional context I've tried using a higher intermediate_graph_degree, graph_degree, and itopk_size, but this doesn't really improve the recall. For example, when i set the intermediate_graph_degree and graph_degree to 128 and 64 respectively, and set the itopk_size at 200, the recall for cohere-768-ip was 12.8%. If if i increased itopk_size to 500, the recall was 12.9%.
@lowener
I've updated the issue with the recall I get when I use FAISS. Strangely, the recall for coherev2-dbpedia is higher for cuVS compared to FAISS. I also consistently see FAISS give worse recall for inner product datasets compared to the L2 datasets I've tested with
I have been able to reproduce the low recall when using cuVS, but I also get the same low recall through FAISS using the code left in comment in the main. Even for coherev2-dbpedia I get similar result around 98/99 instead of one signficantly better than the other.
I will keep on investigating it and try other algorithms to have a better understanding of what's happening for those datasets.
Thanks @lowener. For reference, I used cuvs version 24.12 and I built faiss off of this commit: https://github.com/facebookresearch/faiss/commit/df6a8f6b4e6ed4c509e52d1e015f87fd742c17df
Interestingly, I found that using ivf_pq_build_params instead of ivf_pq_params here: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/main.py#L375 resulted in higher recall and faster index build times for the marco-tasb and cohere-768-ip datasets. I saw no difference in recall for coherev2-dbpedia.
| Dataset name | Recall with ivf_pq_params |
Total build time with ivf_pq_params (s) |
Recall with ivf_pq_build_params |
Total build time with ivf_pq_build_params (s) |
|---|---|---|---|---|
| coherev2-dbpedia | 75.5% | 300.6 | 75.5% | 222 |
| marco-tasb | 93.1% | 128.19 | 95.3% | 101.3 |
| cohere-768-ip | 82.6% | 127.32 | 91.6% | 98.22 |
I hope this is another useful data point. I'm curious on why the recall and build time would be better with ivf_pq_build_params though. It seems like pq_dim gets set to a smaller value with ivf_pq_build_params, but then I would expect the recall to get worse.
I was able to reproduce your FAISS results.
I found today that the problem is indeed with ivf_pq_build_params. The metric for ivf_pq is not correctly initialized and always takes L2, and it is currently not yet customizable through the python API. I tested a fix and was able to get cohere-768-ip to a recall over 80% instead of the previous 12%, so I will create a bugfix PR for that
Thank you @lowener! Will this code change also improve the recall we see with faiss for cohere-768-ip? The 82.6% recall we see with faiss is still lower than the recall we get for L2 datasets. The L2 datasets give a recall of 95%+ with the same CAGRA parameters.
With the same parameters I get a recall at k=100 of 93%
IndexParams(intermediate_graph_degree=64,graph_degree=32,build_algo='ivf_pq', metric="inner_product")
SearchParams(itopk_size = 200)
I am however encountering a problem with the FlickrImagesTextQueries dataset so I am looking further into it.
| Dataset | Recall |
|---|---|
| coherev2-dbpedia | 75.3% |
| FlickrImagesTextQueries | 0.113% (77.0% if using FP32 LUT) |
| marco_tasb | 93.0% |
| cohere-768-ip | 88.2% |
Edit: This is due to the normalization of the data, that is used with a FP16 lookup table. This leads to imprecision and a loss of recall. I am adding a commit to the PR fixing that. With FP32 I can get a recall of 77.0%
Hi @lowener, I built faiss from source using cuvs 25.06. I verified the recall is now 91.9% for the cohere-768 IP dataset using faiss-cuvs - this is a big improvement from before! I'm currently testing the other datasets and will post the results here when I'm done.
Here are my new faiss-cuvs and cuvs results, with cuvs-25.06:
| Dataset name | Location | Space type | Dimensions | Documents | Normalized | Recall with cuVS API | Recall with FAISS API |
|---|---|---|---|---|---|---|---|
| coherev2-dbpedia | https://huggingface.co/datasets/navneet1v/datasets/resolve/main/coherev2-dbpedia.hdf5?download=true | inner product | 4096 | 450,000 | No | 75.3% | 75.5% |
| FlickrImagesTextQueries | https://huggingface.co/datasets/navneet1v/datasets/resolve/main/FlickrImagesTextQueries.hdf5?download=true | inner product | 512 | 1,831,403 | Yes | 77.0% | 82.6% |
| marco-tasb | https://huggingface.co/datasets/navneet1v/datasets/resolve/main/marco_tasb.hdf5?download=true | inner product | 768 | 1,000,000 | No | 93.0% | 95.5% |
| cohere-768-ip | https://dbyiw3u3rf9yr.cloudfront.net/corpora/vectorsearch/cohere-wikipedia-22-12-en-embeddings/documents-1m.hdf5.bz2 | inner product | 768 | 1,000,000 | No | 88.2% | 91.9% |
For FlickrImagesTextQueries, we actually don't need to normalize before reading the dataset. I switched the parameter here: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/main.py#L318 to False, and got better recall. Basically, the dataset is already normalized so we don't need to do it again in the code. Sorry for the earlier confusion!
I think this issue can be resolved now, thanks @lowener for fixing it