cuvs icon indicating copy to clipboard operation
cuvs copied to clipboard

[BUG] cuVS Cagra Python API has low recall for inner product datasets

Open rchitale7 opened this issue 8 months ago • 7 comments

Describe the bug When testing the cuVS Python Cagra API for certain inner product datasets, I get a low recall value. I tested the following ANN datasets with k = 100:

Dataset name Location Space type Dimensions Documents Normalized Recall with cuVS API Recall with FAISS API
coherev2-dbpedia https://huggingface.co/datasets/navneet1v/datasets/resolve/main/coherev2-dbpedia.hdf5?download=true inner product 4096 450,000 No 98.6% 75.5%
FlickrImagesTextQueries https://huggingface.co/datasets/navneet1v/datasets/resolve/main/FlickrImagesTextQueries.hdf5?download=true inner product 512 1,831,403 Yes 11.9% 82.1%
marco-tasb https://huggingface.co/datasets/navneet1v/datasets/resolve/main/marco_tasb.hdf5?download=true inner product 768 1,000,000 No 51.4% 93.1%
cohere-768-ip https://dbyiw3u3rf9yr.cloudfront.net/corpora/vectorsearch/cohere-wikipedia-22-12-en-embeddings/documents-1m.hdf5.bz2 inner product 768 1,000,000 No 12.7% 82.6%

I've also added the recall I get when I use the FAISS Cagra Python API. Both the cuVS and Faiss tests used 64 as the intermediate graph degree and 32 as the graph degree.

Except for coherev2-dbpedia, all datasets gave a significantly lower recall value for cuVS Python API compared with FAISS. I have not seen this issue with the L2 datasets I've tested with.

Steps/Code to reproduce bug These are the steps to reproduce the issue with cuVS Python API

  1. On a server with GPUs, clone https://github.com/navneet1v/VectorSearchForge
    • Server must have git and docker installed
    • Server must have nvidia developer tools installed, such as nvidia-smi and nvidia-container-toolkit
  2. cd into cuvs_benchmarks folder, and create a temp directory to store the logs
mkdir ./benchmarks_files
chmod 777 ./benchmarks_files
  1. Build the docker image:
docker build -t <your_image_name> .
  1. Run the image:
docker run -v ./benchmarks_files:/tmp/files --gpus all <your_image_name>

The cuVS Cagra API is called in this function: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/main.py#L303. The relevant code snippet looks like this:

        logging.info(f"Running for workload {workload['dataset_name']}")
        file = downloadDataSetForWorkload(workload)
        d, xb, ids = prepare_indexing_dataset(file, workload["normalize"])
        index_params = cagra.IndexParams(intermediate_graph_degree=64,graph_degree=32,build_algo='ivf_pq', metric="inner_product")

        index = cagra.build(index_params, xb)

        d, xq, gt = prepare_search_dataset(file, workload["normalize"])

        xq = cp.asarray(xq)

        search_params = cagra.SearchParams(itopk_size = 200)
        distances, neighbors = cagra.search(search_params, index, xq, 100)

        logging.info("Search is done")
        neighbors = cp.asnumpy(neighbors)

        logging.info(f"Recall at k=100 is : {recall_at_r(neighbors, gt, 100, 100, len(xq))}")
        logging.info("Sleeping for 5 seconds")
        time.sleep(5)

Expected behavior The recall should be > 80% for all of the datasets.

Environment details (please complete the following information):

  • Environment location: AWS EC2 g5.2xlarge, with Deep Learning Base AMI.
    • Type of GPU: 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
  • Method of RAFT install: conda, Docker
    • cuVS and rapids are installed in this line of the Dockerfile: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/Dockerfile#L5

Additional context I've tried using a higher intermediate_graph_degree, graph_degree, and itopk_size, but this doesn't really improve the recall. For example, when i set the intermediate_graph_degree and graph_degree to 128 and 64 respectively, and set the itopk_size at 200, the recall for cohere-768-ip was 12.8%. If if i increased itopk_size to 500, the recall was 12.9%.

rchitale7 avatar Apr 24 '25 21:04 rchitale7

@lowener

I've updated the issue with the recall I get when I use FAISS. Strangely, the recall for coherev2-dbpedia is higher for cuVS compared to FAISS. I also consistently see FAISS give worse recall for inner product datasets compared to the L2 datasets I've tested with

rchitale7 avatar Apr 25 '25 23:04 rchitale7

I have been able to reproduce the low recall when using cuVS, but I also get the same low recall through FAISS using the code left in comment in the main. Even for coherev2-dbpedia I get similar result around 98/99 instead of one signficantly better than the other. I will keep on investigating it and try other algorithms to have a better understanding of what's happening for those datasets.

lowener avatar Apr 29 '25 17:04 lowener

Thanks @lowener. For reference, I used cuvs version 24.12 and I built faiss off of this commit: https://github.com/facebookresearch/faiss/commit/df6a8f6b4e6ed4c509e52d1e015f87fd742c17df

rchitale7 avatar Apr 29 '25 21:04 rchitale7

Interestingly, I found that using ivf_pq_build_params instead of ivf_pq_params here: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/main.py#L375 resulted in higher recall and faster index build times for the marco-tasb and cohere-768-ip datasets. I saw no difference in recall for coherev2-dbpedia.

Dataset name Recall with ivf_pq_params Total build time with ivf_pq_params (s) Recall with ivf_pq_build_params Total build time with ivf_pq_build_params (s)
coherev2-dbpedia 75.5% 300.6 75.5% 222
marco-tasb 93.1% 128.19 95.3% 101.3
cohere-768-ip 82.6% 127.32 91.6% 98.22

I hope this is another useful data point. I'm curious on why the recall and build time would be better with ivf_pq_build_params though. It seems like pq_dim gets set to a smaller value with ivf_pq_build_params, but then I would expect the recall to get worse.

rchitale7 avatar Apr 30 '25 15:04 rchitale7

I was able to reproduce your FAISS results. I found today that the problem is indeed with ivf_pq_build_params. The metric for ivf_pq is not correctly initialized and always takes L2, and it is currently not yet customizable through the python API. I tested a fix and was able to get cohere-768-ip to a recall over 80% instead of the previous 12%, so I will create a bugfix PR for that

lowener avatar Apr 30 '25 16:04 lowener

Thank you @lowener! Will this code change also improve the recall we see with faiss for cohere-768-ip? The 82.6% recall we see with faiss is still lower than the recall we get for L2 datasets. The L2 datasets give a recall of 95%+ with the same CAGRA parameters.

rchitale7 avatar Apr 30 '25 17:04 rchitale7

With the same parameters I get a recall at k=100 of 93%

IndexParams(intermediate_graph_degree=64,graph_degree=32,build_algo='ivf_pq', metric="inner_product")
SearchParams(itopk_size = 200)

I am however encountering a problem with the FlickrImagesTextQueries dataset so I am looking further into it.

Dataset Recall
coherev2-dbpedia 75.3%
FlickrImagesTextQueries 0.113% (77.0% if using FP32 LUT)
marco_tasb 93.0%
cohere-768-ip 88.2%

Edit: This is due to the normalization of the data, that is used with a FP16 lookup table. This leads to imprecision and a loss of recall. I am adding a commit to the PR fixing that. With FP32 I can get a recall of 77.0%

lowener avatar May 02 '25 17:05 lowener

Hi @lowener, I built faiss from source using cuvs 25.06. I verified the recall is now 91.9% for the cohere-768 IP dataset using faiss-cuvs - this is a big improvement from before! I'm currently testing the other datasets and will post the results here when I'm done.

rchitale7 avatar May 15 '25 16:05 rchitale7

Here are my new faiss-cuvs and cuvs results, with cuvs-25.06:

Dataset name Location Space type Dimensions Documents Normalized Recall with cuVS API Recall with FAISS API
coherev2-dbpedia https://huggingface.co/datasets/navneet1v/datasets/resolve/main/coherev2-dbpedia.hdf5?download=true inner product 4096 450,000 No 75.3% 75.5%
FlickrImagesTextQueries https://huggingface.co/datasets/navneet1v/datasets/resolve/main/FlickrImagesTextQueries.hdf5?download=true inner product 512 1,831,403 Yes 77.0% 82.6%
marco-tasb https://huggingface.co/datasets/navneet1v/datasets/resolve/main/marco_tasb.hdf5?download=true inner product 768 1,000,000 No 93.0% 95.5%
cohere-768-ip https://dbyiw3u3rf9yr.cloudfront.net/corpora/vectorsearch/cohere-wikipedia-22-12-en-embeddings/documents-1m.hdf5.bz2 inner product 768 1,000,000 No 88.2% 91.9%

For FlickrImagesTextQueries, we actually don't need to normalize before reading the dataset. I switched the parameter here: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/main.py#L318 to False, and got better recall. Basically, the dataset is already normalized so we don't need to do it again in the code. Sorry for the earlier confusion!

I think this issue can be resolved now, thanks @lowener for fixing it

rchitale7 avatar May 15 '25 23:05 rchitale7