cuvs icon indicating copy to clipboard operation
cuvs copied to clipboard

[FEA] Binary IVF Flat Index

Open tarang-jain opened this issue 5 months ago • 1 comments

Depends on https://github.com/rapidsai/raft/pull/2770

Implementation of binary ivf flat index (bitwise hamming metric for the IVF Flat index)

Key Features

1. Binary Index Structure

  • Added binary_centers_ field to store cluster centers as packed uint8_t arrays for binary data
  • Index automatically detects BitwiseHamming metric and configures itself for binary operation
  • Only support uint8_t inputs with BitwiseHamming and add only single instantiations of newly added kernels

2. K-means Clustering for Binary Data

The clustering approach for binary data required special handling:

  • Expanded Space Clustering: Binary data (uint8_t) is expanded to signed representation (int8_t) where each bit becomes ±1

    • 0 → -1, 1 → +1 transformation enables meaningful centroid computation
    • Clustering performed using L2 distance in the expanded dimensional space
  • Centroid Quantization: After computing float centroids in expanded space, they are converted back to binary format:

    • Centroids are stored as packed uint8_t arrays
    • KMeans (coarse) prediction is done on these quantized centroids with the BitwiseHamming distance.

3. Distance Kernels

Coarse Search (Cluster Selection)

  • Implemented specialized bitwise_hamming_distance_op for query-to-centroid distances in order to compute PairwiseDistances

Fine-Grained Search (Within Clusters)

Extended the interleaved scan kernel (ivf_flat_interleaved_scan.cuh) with specialized templates for BitwiseHamming:

  • Veclen-based optimization: Different code paths based on vectorization width

    • Veclen=16,8,4: Load data as uint32_t, use __popc(x ^ y) for 4-byte Hamming distance
    • Veclen=1,2: Byte-wise XOR and population count
  • Efficient memory access patterns:

    • Maintains interleaved data layout for coalesced memory access
    • Specialized loadAndComputeDist templates for uint8_t that leverage vectorized loads

as of 10/17/2025 Binary size increase: branch-25.12 (CUDA 12.9 + X86): 1232.414 MB This PR (CUDA 12.9 + X86): 1251.051 MB

tarang-jain avatar Jul 09 '25 23:07 tarang-jain

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Jul 09 '25 23:07 copy-pr-bot[bot]

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Dec 03 '25 02:12 copy-pr-bot[bot]

@tfeher addressing your comment here:

One thing is not clear for me: why is uint8_t used internally during k-means training? Each vector is a bitstring, and uint8_t is just a storage type that we use. We could chose uint32_t as well to make the popcount ops more efficient. Since this is an implementation detail, I do not think we need to fix that in the current PR.

The problem with that it wont allow number of bytes (dim) which is not divisible by 4 if we use uint32_t. uint8_t would be the most inclusive for dims.

tarang-jain avatar Dec 03 '25 03:12 tarang-jain

/ok to test 91c6734

tfeher avatar Dec 03 '25 14:12 tfeher