[FEA] Binary IVF Flat Index
Depends on https://github.com/rapidsai/raft/pull/2770
Implementation of binary ivf flat index (bitwise hamming metric for the IVF Flat index)
Key Features
1. Binary Index Structure
- Added
binary_centers_field to store cluster centers as packeduint8_tarrays for binary data - Index automatically detects BitwiseHamming metric and configures itself for binary operation
- Only support
uint8_tinputs with BitwiseHamming and add only single instantiations of newly added kernels
2. K-means Clustering for Binary Data
The clustering approach for binary data required special handling:
-
Expanded Space Clustering: Binary data (uint8_t) is expanded to signed representation (int8_t) where each bit becomes ±1
- 0 → -1, 1 → +1 transformation enables meaningful centroid computation
- Clustering performed using L2 distance in the expanded dimensional space
-
Centroid Quantization: After computing float centroids in expanded space, they are converted back to binary format:
- Centroids are stored as packed uint8_t arrays
- KMeans (coarse) prediction is done on these quantized centroids with the BitwiseHamming distance.
3. Distance Kernels
Coarse Search (Cluster Selection)
- Implemented specialized
bitwise_hamming_distance_opfor query-to-centroid distances in order to computePairwiseDistances
Fine-Grained Search (Within Clusters)
Extended the interleaved scan kernel (ivf_flat_interleaved_scan.cuh) with specialized templates for BitwiseHamming:
-
Veclen-based optimization: Different code paths based on vectorization width
- Veclen=16,8,4: Load data as
uint32_t, use__popc(x ^ y)for 4-byte Hamming distance - Veclen=1,2: Byte-wise XOR and population count
- Veclen=16,8,4: Load data as
-
Efficient memory access patterns:
- Maintains interleaved data layout for coalesced memory access
- Specialized
loadAndComputeDisttemplates foruint8_tthat leverage vectorized loads
as of 10/17/2025 Binary size increase: branch-25.12 (CUDA 12.9 + X86): 1232.414 MB This PR (CUDA 12.9 + X86): 1251.051 MB
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.
Contributors can view more details about this message here.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
@tfeher addressing your comment here:
One thing is not clear for me: why is uint8_t used internally during k-means training? Each vector is a bitstring, and uint8_t is just a storage type that we use. We could chose uint32_t as well to make the popcount ops more efficient. Since this is an implementation detail, I do not think we need to fix that in the current PR.
The problem with that it wont allow number of bytes (dim) which is not divisible by 4 if we use uint32_t. uint8_t would be the most inclusive for dims.
/ok to test 91c6734