faiss icon indicating copy to clipboard operation
faiss copied to clipboard

IndexScalarQuantizer w/ QT_fp16 hangs at search - Sharded Multi GPU - FlatIP

Open SohamTamba opened this issue 2 years ago • 3 comments

Summary

I am trying to use a Flat Inner Product Index with half-precision, sharded across 3 GPUs. The code hangs at search time but it does not hang while using full precision.

Platform

OS: Ubuntu 16

Faiss version: 1.7.2

Installed from: Anaconda

Faiss compilation options:

Running on:

  • [ ] CPU
  • [ x] GPU

Interface:

  • [ ] C++
  • [ x] Python

Reproduction instructions

Full Precision - Success

I am able to successfully shard a dataset of 25M x 768 array across 3 v100 GPUs. The dataset is ~77G big (25M x 4 x 768) and each GPU has 32GB of RAM, 96GB in total. This is accomplished using the code

top_k = 300

index = faiss.IndexFlatIP(768)
co = faiss.GpuMultipleClonerOptions()
co.shard = True
index = faiss.index_cpu_to_all_gpus(index, co, ngpu=3)
index.add(A_index_25M)

index.search(A_query_10K, top_k)

Half Precision - Fails

Now I want to scale to 50M x 768 array. I tried using half precision on the 25M x 768 array embedding but the code hangs during search (search does complete even after 10x the time taken by the former

top_k = 300

index = faiss.IndexScalarQuantizer(768, faiss.ScalarQuantizer.QT_fp16, faiss.METRIC_INNER_PRODUCT)
co = faiss.GpuMultipleClonerOptions()
co.shard = True
index = faiss.index_cpu_to_all_gpus(index, co, ngpu=3)
index.add(A_index_25M)

index.search(A_query_10K, top_k) # Code hangs here

Also, when running the former code, GPU memory useage increases by 25GB. But while running the half precision code, memory useage increases to 5MB (almost nothing).


Am I using the Half Precision Index as it was intended? cc: @mdouze

SohamTamba avatar Oct 17 '22 04:10 SohamTamba

On a side note, why do we normally use 32 bit floats by default? We only care about the ordering of the inner product and I'd be surprised if using 16 bit instead of 32 bit made notice-able difference in the ordering.

SohamTamba avatar Oct 17 '22 04:10 SohamTamba

So I assume that the code hangs in add not search. Does it also hang if you add 25M / 3 embeddings on a single GPU?

mdouze avatar Oct 17 '22 14:10 mdouze

@mdouze Good afternoon! I faced the same problem as Soham, here is an code snippet with my short comments: The fist example is when I use faiss.IndexScalarQuantizer. It seems that in this case index is not copied to GPUs so all computations are done on a CPU. So I monitored GPU's utilization and the GPUs simply idle. Just want to clarify, it seems that the neighbors search hang because of the size of index, the more the index size, the longer it takes to compute on the CPU. Initially I tried to build index on 12 millions embeddings and found out that nothing happened, then I decreased the index size to 100 k embeddings, and everything was fine except that GPUs were not used during computations. For me it seems that index_cpu_to_all_gpus simple ignore cloner_options here.

    cloner_options = faiss.GpuMultipleClonerOptions()
    cloner_options.shard = True
    num_gpus = faiss.get_num_gpus()
    logging.info(f"Num GPUs: {num_gpus}")
    index = faiss.index_cpu_to_all_gpus(cpu_index, cloner_options, ngpu=num_gpus)
    index.add(index_buff)
    ...
    for q_idx, chunk in enumerate(queries_chunks):
        chunk = chunk.squeeze()
        q_names_chunk = query_name_chunks[q_idx]
        dist, nns = index.search(chunk, args.top_k) # hangs here

So I tried to use fp32 as it is done by default. In this case everything works fine, all GPUs are utilized, neighbors finding is done really fast.

      cpu_index = faiss.IndexFlatIP(args.embeds_dim)
      cloner_options = faiss.GpuMultipleClonerOptions()
      cloner_options.shard = True
      num_gpus = faiss.get_num_gpus()
      logging.info(f"Num GPUs: {num_gpus}")
      index = faiss.index_cpu_to_all_gpus(cpu_index, cloner_options, ngpu=num_gpus)
      index.add(index_buff)
      ...
      for q_idx, chunk in enumerate(queries_chunks):
          chunk = chunk.squeeze()
          q_names_chunk = query_name_chunks[q_idx]
          dist, nns = index.search(chunk, args.top_k)

So my question is: whether it is possible to compute inner product in half precision using sharded index?

P.S. Here is a short summary of my data:

  • embedding_dim = 512
  • number_of_embedding ~ 12 millions
  • number_of_gpus = 4
  • gpu_model quatro8000 (around 45 Gb of video memory)
  • number_of_CPUs = 36
  • RAM = 240 Gb

dmasny99 avatar Jan 12 '24 09:01 dmasny99