cuvs [BUG] cuVS Cagra Python API has low recall for inner product datasets

Describe the bug When testing the cuVS Python Cagra API for certain inner product datasets, I get a low recall value. I tested the following ANN datasets with k = 100:

Dataset name	Location	Space type	Dimensions	Documents	Normalized	Recall with cuVS API	Recall with FAISS API
coherev2-dbpedia	https://huggingface.co/datasets/navneet1v/datasets/resolve/main/coherev2-dbpedia.hdf5?download=true	inner product	4096	450,000	No	98.6%	75.5%
FlickrImagesTextQueries	https://huggingface.co/datasets/navneet1v/datasets/resolve/main/FlickrImagesTextQueries.hdf5?download=true	inner product	512	1,831,403	Yes	11.9%	82.1%
marco-tasb	https://huggingface.co/datasets/navneet1v/datasets/resolve/main/marco_tasb.hdf5?download=true	inner product	768	1,000,000	No	51.4%	93.1%
cohere-768-ip	https://dbyiw3u3rf9yr.cloudfront.net/corpora/vectorsearch/cohere-wikipedia-22-12-en-embeddings/documents-1m.hdf5.bz2	inner product	768	1,000,000	No	12.7%	82.6%

I've also added the recall I get when I use the FAISS Cagra Python API. Both the cuVS and Faiss tests used 64 as the intermediate graph degree and 32 as the graph degree.

Except for coherev2-dbpedia, all datasets gave a significantly lower recall value for cuVS Python API compared with FAISS. I have not seen this issue with the L2 datasets I've tested with.

Steps/Code to reproduce bug These are the steps to reproduce the issue with cuVS Python API

On a server with GPUs, clone https://github.com/navneet1v/VectorSearchForge
- Server must have git and docker installed
- Server must have nvidia developer tools installed, such as nvidia-smi and nvidia-container-toolkit
cd into cuvs_benchmarks folder, and create a temp directory to store the logs

mkdir ./benchmarks_files
chmod 777 ./benchmarks_files

Build the docker image:

docker build -t <your_image_name> .

Run the image:

docker run -v ./benchmarks_files:/tmp/files --gpus all <your_image_name>

The cuVS Cagra API is called in this function: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/main.py#L303. The relevant code snippet looks like this:

        logging.info(f"Running for workload {workload['dataset_name']}")
        file = downloadDataSetForWorkload(workload)
        d, xb, ids = prepare_indexing_dataset(file, workload["normalize"])
        index_params = cagra.IndexParams(intermediate_graph_degree=64,graph_degree=32,build_algo='ivf_pq', metric="inner_product")

        index = cagra.build(index_params, xb)

        d, xq, gt = prepare_search_dataset(file, workload["normalize"])

        xq = cp.asarray(xq)

        search_params = cagra.SearchParams(itopk_size = 200)
        distances, neighbors = cagra.search(search_params, index, xq, 100)

        logging.info("Search is done")
        neighbors = cp.asnumpy(neighbors)

        logging.info(f"Recall at k=100 is : {recall_at_r(neighbors, gt, 100, 100, len(xq))}")
        logging.info("Sleeping for 5 seconds")
        time.sleep(5)

Expected behavior The recall should be > 80% for all of the datasets.

Environment details (please complete the following information):

Environment location: AWS EC2 g5.2xlarge, with Deep Learning Base AMI.
- Type of GPU: 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
Method of RAFT install: conda, Docker
- cuVS and rapids are installed in this line of the Dockerfile: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/Dockerfile#L5

Additional context I've tried using a higher intermediate_graph_degree, graph_degree, and itopk_size, but this doesn't really improve the recall. For example, when i set the intermediate_graph_degree and graph_degree to 128 and 64 respectively, and set the itopk_size at 200, the recall for cohere-768-ip was 12.8%. If if i increased itopk_size to 500, the recall was 12.9%.

Apr 24 '25 21:04 rchitale7

@lowener

I've updated the issue with the recall I get when I use FAISS. Strangely, the recall for coherev2-dbpedia is higher for cuVS compared to FAISS. I also consistently see FAISS give worse recall for inner product datasets compared to the L2 datasets I've tested with

Apr 25 '25 23:04 rchitale7

I have been able to reproduce the low recall when using cuVS, but I also get the same low recall through FAISS using the code left in comment in the main. Even for coherev2-dbpedia I get similar result around 98/99 instead of one signficantly better than the other. I will keep on investigating it and try other algorithms to have a better understanding of what's happening for those datasets.

Apr 29 '25 17:04 lowener

Thanks @lowener. For reference, I used cuvs version 24.12 and I built faiss off of this commit: https://github.com/facebookresearch/faiss/commit/df6a8f6b4e6ed4c509e52d1e015f87fd742c17df

Apr 29 '25 21:04 rchitale7

Interestingly, I found that using ivf_pq_build_params instead of ivf_pq_params here: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/main.py#L375 resulted in higher recall and faster index build times for the marco-tasb and cohere-768-ip datasets. I saw no difference in recall for coherev2-dbpedia.

Dataset name	Recall with `ivf_pq_params`	Total build time with `ivf_pq_params` (s)	Recall with `ivf_pq_build_params`	Total build time with `ivf_pq_build_params` (s)
coherev2-dbpedia	75.5%	300.6	75.5%	222
marco-tasb	93.1%	128.19	95.3%	101.3
cohere-768-ip	82.6%	127.32	91.6%	98.22

I hope this is another useful data point. I'm curious on why the recall and build time would be better with ivf_pq_build_params though. It seems like pq_dim gets set to a smaller value with ivf_pq_build_params, but then I would expect the recall to get worse.

Apr 30 '25 15:04 rchitale7

I was able to reproduce your FAISS results. I found today that the problem is indeed with ivf_pq_build_params. The metric for ivf_pq is not correctly initialized and always takes L2, and it is currently not yet customizable through the python API. I tested a fix and was able to get cohere-768-ip to a recall over 80% instead of the previous 12%, so I will create a bugfix PR for that

Apr 30 '25 16:04 lowener

Thank you @lowener! Will this code change also improve the recall we see with faiss for cohere-768-ip? The 82.6% recall we see with faiss is still lower than the recall we get for L2 datasets. The L2 datasets give a recall of 95%+ with the same CAGRA parameters.

Apr 30 '25 17:04 rchitale7

With the same parameters I get a recall at k=100 of 93%

IndexParams(intermediate_graph_degree=64,graph_degree=32,build_algo='ivf_pq', metric="inner_product")
SearchParams(itopk_size = 200)

I am however encountering a problem with the FlickrImagesTextQueries dataset so I am looking further into it.

Dataset	Recall
coherev2-dbpedia	75.3%
FlickrImagesTextQueries	0.113% (77.0% if using FP32 LUT)
marco_tasb	93.0%
cohere-768-ip	88.2%

Edit: This is due to the normalization of the data, that is used with a FP16 lookup table. This leads to imprecision and a loss of recall. I am adding a commit to the PR fixing that. With FP32 I can get a recall of 77.0%

May 02 '25 17:05 lowener

Hi @lowener, I built faiss from source using cuvs 25.06. I verified the recall is now 91.9% for the cohere-768 IP dataset using faiss-cuvs - this is a big improvement from before! I'm currently testing the other datasets and will post the results here when I'm done.

May 15 '25 16:05 rchitale7

Here are my new faiss-cuvs and cuvs results, with cuvs-25.06:

Dataset name	Location	Space type	Dimensions	Documents	Normalized	Recall with cuVS API	Recall with FAISS API
coherev2-dbpedia	https://huggingface.co/datasets/navneet1v/datasets/resolve/main/coherev2-dbpedia.hdf5?download=true	inner product	4096	450,000	No	75.3%	75.5%
FlickrImagesTextQueries	https://huggingface.co/datasets/navneet1v/datasets/resolve/main/FlickrImagesTextQueries.hdf5?download=true	inner product	512	1,831,403	Yes	77.0%	82.6%
marco-tasb	https://huggingface.co/datasets/navneet1v/datasets/resolve/main/marco_tasb.hdf5?download=true	inner product	768	1,000,000	No	93.0%	95.5%
cohere-768-ip	https://dbyiw3u3rf9yr.cloudfront.net/corpora/vectorsearch/cohere-wikipedia-22-12-en-embeddings/documents-1m.hdf5.bz2	inner product	768	1,000,000	No	88.2%	91.9%

For FlickrImagesTextQueries, we actually don't need to normalize before reading the dataset. I switched the parameter here: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/main.py#L318 to False, and got better recall. Basically, the dataset is already normalized so we don't need to do it again in the code. Sorry for the earlier confusion!

I think this issue can be resolved now, thanks @lowener for fixing it

May 15 '25 23:05 rchitale7