faiss
faiss copied to clipboard
IndexScalarQuantizer w/ QT_fp16 hangs at search - Sharded Multi GPU - FlatIP
Summary
I am trying to use a Flat Inner Product Index with half-precision, sharded across 3 GPUs. The code hangs at search time but it does not hang while using full precision.
Platform
OS: Ubuntu 16
Faiss version: 1.7.2
Installed from: Anaconda
Faiss compilation options:
Running on:
- [ ] CPU
- [ x] GPU
Interface:
- [ ] C++
- [ x] Python
Reproduction instructions
Full Precision - Success
I am able to successfully shard a dataset of 25M x 768 array across 3 v100 GPUs. The dataset is ~77G big (25M x 4 x 768) and each GPU has 32GB of RAM, 96GB in total. This is accomplished using the code
top_k = 300
index = faiss.IndexFlatIP(768)
co = faiss.GpuMultipleClonerOptions()
co.shard = True
index = faiss.index_cpu_to_all_gpus(index, co, ngpu=3)
index.add(A_index_25M)
index.search(A_query_10K, top_k)
Half Precision - Fails
Now I want to scale to 50M x 768 array. I tried using half precision on the 25M x 768 array embedding but the code hangs during search (search does complete even after 10x the time taken by the former
top_k = 300
index = faiss.IndexScalarQuantizer(768, faiss.ScalarQuantizer.QT_fp16, faiss.METRIC_INNER_PRODUCT)
co = faiss.GpuMultipleClonerOptions()
co.shard = True
index = faiss.index_cpu_to_all_gpus(index, co, ngpu=3)
index.add(A_index_25M)
index.search(A_query_10K, top_k) # Code hangs here
Also, when running the former code, GPU memory useage increases by 25GB. But while running the half precision code, memory useage increases to 5MB (almost nothing).
Am I using the Half Precision Index as it was intended? cc: @mdouze
On a side note, why do we normally use 32 bit floats by default? We only care about the ordering of the inner product and I'd be surprised if using 16 bit instead of 32 bit made notice-able difference in the ordering.
So I assume that the code hangs in add not search. Does it also hang if you add 25M / 3 embeddings on a single GPU?
@mdouze Good afternoon! I faced the same problem as Soham, here is an code snippet with my short comments:
The fist example is when I use faiss.IndexScalarQuantizer
. It seems that in this case index is not copied to GPUs so all computations are done on a CPU. So I monitored GPU's utilization and the GPUs simply idle. Just want to clarify, it seems that the neighbors search hang because of the size of index, the more the index size, the longer it takes to compute on the CPU. Initially I tried to build index on 12 millions embeddings and found out that nothing happened, then I decreased the index size to 100 k embeddings, and everything was fine except that GPUs were not used during computations.
For me it seems that index_cpu_to_all_gpus simple ignore cloner_options here.
cloner_options = faiss.GpuMultipleClonerOptions()
cloner_options.shard = True
num_gpus = faiss.get_num_gpus()
logging.info(f"Num GPUs: {num_gpus}")
index = faiss.index_cpu_to_all_gpus(cpu_index, cloner_options, ngpu=num_gpus)
index.add(index_buff)
...
for q_idx, chunk in enumerate(queries_chunks):
chunk = chunk.squeeze()
q_names_chunk = query_name_chunks[q_idx]
dist, nns = index.search(chunk, args.top_k) # hangs here
So I tried to use fp32 as it is done by default. In this case everything works fine, all GPUs are utilized, neighbors finding is done really fast.
cpu_index = faiss.IndexFlatIP(args.embeds_dim)
cloner_options = faiss.GpuMultipleClonerOptions()
cloner_options.shard = True
num_gpus = faiss.get_num_gpus()
logging.info(f"Num GPUs: {num_gpus}")
index = faiss.index_cpu_to_all_gpus(cpu_index, cloner_options, ngpu=num_gpus)
index.add(index_buff)
...
for q_idx, chunk in enumerate(queries_chunks):
chunk = chunk.squeeze()
q_names_chunk = query_name_chunks[q_idx]
dist, nns = index.search(chunk, args.top_k)
So my question is: whether it is possible to compute inner product in half precision using sharded index?
P.S. Here is a short summary of my data:
- embedding_dim = 512
- number_of_embedding ~ 12 millions
- number_of_gpus = 4
- gpu_model quatro8000 (around 45 Gb of video memory)
- number_of_CPUs = 36
- RAM = 240 Gb