TileDB-Vector-Search
TileDB-Vector-Search copied to clipboard
Lums/sc 42599/implement ivf pq index
This PR implements IVF index with PQ encoding. It requires the results from #265 so this is a branch from there.
As the name implies, this index synthesizes the indexing from IVF with the vector encoding from PQ. Ultimately there should be a decomposition of indexes into index representation, vector representation, and distance computation. However, because there is still a need to gather sufficient implementation experience with the different variants of each (and their compositions), the current implementation is monolithic. The new story sc-43058 was created for this refactoring.
The current scheme for C++ indexes includes implementing a class for metadata, for the storage group, and for the index itself. These exist as separate files in the index subdirectory for each of the current indexes. (Note that flat_l2_index and flat_pq_index have not been re-factored this way yet, though placeholder files for the metadata and group classes have been created.
Group and Metadata
The metadataivf_pq_metata essentially includes the metadata for each of IVF Flat and Flat PQ. The group ivf_pq_group contains arrays that are the integration of IVF and PQ:
cluster_centroids-- the centroids used for encoding with PQflat_ivf_centroids-- the flat (unencoded) centroids used for IVF indexing. These are obtained by applying kmeans to the flat input vectors.pq_ivf_centroids-- the PQ encoded centroids used for IVF indexing. These are currently obtained by PQ encoding theflat_ivf_centroids.pq_ivf_vectors-- PQ encoded vectors from the input, partitioned and shuffledivf_index-- IVF index vector (same as in IVF Flat)ivf_ids-- vector ids (same as in IVF Flat)distance_tables_*-- distance tables, one table per subspace and one array per table. These should probably go into a single array at some point. The data forpq_ivf_vectors,ivf_index, andivf_idsare maintained as aPartitionedMatrix(ortdbPartitionedMatrix).
Ingestion
The first step in ingestion is in train_pq which generates the cluster_centroids -- using sub_kmeans as was used in the previous PQ related PR. (Note that all kmeans functions have been moved to the file kmeans.h). Once we have the cluster_centroids we have the ability to encode all of the input vectors. train_pq also generates the distance_tables.
The train_ivf function applies kmeans to create a set of centroids and partition the input vectors. There are two ways this could be accomplished: Partition the flat vectors first to create a set of flat centroids, or encode the vectors first, apply kmeans to the encoded vectors to create a set of encoded (or unencoded) centroids. We opted for the former in this PR. With the former, we can then create compressed pq_ivf_centroids from the flat_ivf_centroids.
Once we have the ivf_centroids (flat or pq), we can partition and shuffle the input vectors (unencoded or encoded). In our case, we use the flat centroids to create the partitioning and then apply that partitioning to shuffle the encoded vectors. Again, there are multiple variations to try in terms of whether to use encoded or unencoded quantities at various steps in the ingestion / indexing process.
Right now everything happens in the add function:
train_pq(training_set); // Create cluster_centroids_, distance_tables_
train_ivf(training_set); // Create flat_ivf_centroids_
unpartitioned_pq_vectors_ = pq_encode(training_set); // encode the input vectors
pq_ivf_centroids_ = std::move(*pq_encode(flat_ivf_centroids_)); // encode the flat ivf centroids
auto partition_labels = detail::flat::qv_partition(
flat_ivf_centroids_, training_set, num_threads_, distance); // partition
partitioned_pq_vectors_ = std::make_unique<pq_storage_type>(
*unpartitioned_pq_vectors_, partition_labels, num_unique_labels); // shuffle
There is not currently a separate train function, though perhaps this add() function should be split into different parts. These pieces were kept together for the time being to allow experimentation with different ordering of encoding and so forth.
Note that when the create_default_impl function is called during ingestion, the metadata are set with information from the index at the very top of the function, as that information will be used later on in the function.
Queries
Queries are applied by calling the existing query functions to search the ivf index, using an appropriate distance function. If we pass in flat query vectors, we use the asymmetric distance function. If we pass in compressed query vectors, we use the symmetric distance function. Currently we do not compress the queries, so we use the asymmetric distance function. Per sc-43051, we should SIMDize these functions, as well as reorder the loops for encoding, etc -- which should result in 10-20X speedup. The ivf_pq_index class only currently includes query_finite and query_infinite (it does not have calls tot the various qv, nut, and so forth).
Unit Testing
Unit testing performs most of the same tests as for flat_pq and for ivf_flat -- though currently not as thoroughly.
Distributed Computation
This should be as easy to distribute as IVF Flat, since partitioning and querying (etc) are oblivious to vector representation.
This pull request has been linked to Shortcut Story #42599: Implement IVF PQ Index.
NOTE(paris): I'm currently waiting to merge this until Vamana is fully working end to end, as that will help keep the changes I need to do when fixing bugs in the group / metadata / etc. down. Once it's working I'll bring this in and apply all changes.