TileDB-Vector-Search Lums/sc 42599/implement ivf pq index

This PR implements IVF index with PQ encoding. It requires the results from #265 so this is a branch from there.

As the name implies, this index synthesizes the indexing from IVF with the vector encoding from PQ. Ultimately there should be a decomposition of indexes into index representation, vector representation, and distance computation. However, because there is still a need to gather sufficient implementation experience with the different variants of each (and their compositions), the current implementation is monolithic. The new story sc-43058 was created for this refactoring.

The current scheme for C++ indexes includes implementing a class for metadata, for the storage group, and for the index itself. These exist as separate files in the index subdirectory for each of the current indexes. (Note that flat_l2_index and flat_pq_index have not been re-factored this way yet, though placeholder files for the metadata and group classes have been created.

Group and Metadata

The metadataivf_pq_metata essentially includes the metadata for each of IVF Flat and Flat PQ. The group ivf_pq_group contains arrays that are the integration of IVF and PQ:

cluster_centroids -- the centroids used for encoding with PQ
flat_ivf_centroids -- the flat (unencoded) centroids used for IVF indexing. These are obtained by applying kmeans to the flat input vectors.
pq_ivf_centroids -- the PQ encoded centroids used for IVF indexing. These are currently obtained by PQ encoding the flat_ivf_centroids.
pq_ivf_vectors -- PQ encoded vectors from the input, partitioned and shuffled
ivf_index -- IVF index vector (same as in IVF Flat)
ivf_ids -- vector ids (same as in IVF Flat)
distance_tables_* -- distance tables, one table per subspace and one array per table. These should probably go into a single array at some point. The data for pq_ivf_vectors, ivf_index, and ivf_ids are maintained as a PartitionedMatrix (or tdbPartitionedMatrix).

Ingestion

The first step in ingestion is in train_pq which generates the cluster_centroids -- using sub_kmeans as was used in the previous PQ related PR. (Note that all kmeans functions have been moved to the file kmeans.h). Once we have the cluster_centroids we have the ability to encode all of the input vectors. train_pq also generates the distance_tables.

The train_ivf function applies kmeans to create a set of centroids and partition the input vectors. There are two ways this could be accomplished: Partition the flat vectors first to create a set of flat centroids, or encode the vectors first, apply kmeans to the encoded vectors to create a set of encoded (or unencoded) centroids. We opted for the former in this PR. With the former, we can then create compressed pq_ivf_centroids from the flat_ivf_centroids.

Once we have the ivf_centroids (flat or pq), we can partition and shuffle the input vectors (unencoded or encoded). In our case, we use the flat centroids to create the partitioning and then apply that partitioning to shuffle the encoded vectors. Again, there are multiple variations to try in terms of whether to use encoded or unencoded quantities at various steps in the ingestion / indexing process.

Right now everything happens in the add function:

    train_pq(training_set);   // Create cluster_centroids_, distance_tables_
    train_ivf(training_set);   // Create flat_ivf_centroids_
    unpartitioned_pq_vectors_ = pq_encode(training_set); // encode the input vectors
    pq_ivf_centroids_ = std::move(*pq_encode(flat_ivf_centroids_)); // encode the flat ivf centroids

    auto partition_labels = detail::flat::qv_partition(
        flat_ivf_centroids_, training_set, num_threads_, distance);  // partition

    partitioned_pq_vectors_ = std::make_unique<pq_storage_type>(
        *unpartitioned_pq_vectors_, partition_labels, num_unique_labels);  // shuffle

There is not currently a separate train function, though perhaps this add() function should be split into different parts. These pieces were kept together for the time being to allow experimentation with different ordering of encoding and so forth.

Note that when the create_default_impl function is called during ingestion, the metadata are set with information from the index at the very top of the function, as that information will be used later on in the function.

Queries

Queries are applied by calling the existing query functions to search the ivf index, using an appropriate distance function. If we pass in flat query vectors, we use the asymmetric distance function. If we pass in compressed query vectors, we use the symmetric distance function. Currently we do not compress the queries, so we use the asymmetric distance function. Per sc-43051, we should SIMDize these functions, as well as reorder the loops for encoding, etc -- which should result in 10-20X speedup. The ivf_pq_index class only currently includes query_finite and query_infinite (it does not have calls tot the various qv, nut, and so forth).

Unit Testing

Unit testing performs most of the same tests as for flat_pq and for ivf_flat -- though currently not as thoroughly.

Distributed Computation

This should be as easy to distribute as IVF Flat, since partitioning and querying (etc) are oblivious to vector representation.

Mar 13 '24 04:03 lums658

This pull request has been linked to Shortcut Story #42599: Implement IVF PQ Index.

Mar 13 '24 04:03 shortcut-integration[bot]

NOTE(paris): I'm currently waiting to merge this until Vamana is fully working end to end, as that will help keep the changes I need to do when fixing bugs in the group / metadata / etc. down. Once it's working I'll bring this in and apply all changes.

May 02 '24 12:05 jparismorgan

TileDB-Vector-Search TileDB-Vector-Search copied to clipboard

Lums/sc 42599/implement ivf pq index

Group and Metadata

Ingestion

Queries

Unit Testing

Distributed Computation

TileDB-Vector-Search
TileDB-Vector-Search copied to clipboard