[Feature] Add support for passing extra information with KNNVectorField
Description
Currently KnnFloatVectorField(float/byte) supports only passing Vector information(which includes dim etc). But sometime you need to pass some more extra information with vector the vector field which will be helpful in building the efficient search data structures. A naive example for this can be a tenant id where you might want to cluster the points specific to a tenant in the hnsw graph.
The ask for this issue is to have a byte array/or some thing similar that can be passed with the vector field and can be made accessible at the KNNVectorsFormat level.
Alternatives
I tried passing the information in another field like a BinaryDocValues, but during indexing at the KNNVectorsFormat we don't have access to the BDV. If there is an alternative way to access which I might be missing, please let me know.
I wonder if there might be a better way to accomplish your actual goal. Adding "extra data" doesn't seem like a good idea to me since it inherently blurs the function of the data format. Can you describe the intended use case more fully? I guess this is some kind of clustering? What does that mean?
I wonder if there might be a better way to accomplish your actual goal. Adding "extra data" doesn't seem like a good idea to me since it inherently blurs the function of the data format. Can you describe the intended use case more fully? I guess this is some kind of clustering? What does that mean?
As mentioned in the description one use-case I was thinking was, lets say I have an identifier named tenantId for the vector and while doing the search I always want to get results for that specific tenant hence I want to build multiple HNSW graphs at a segment level 1 per tenant. So that when search comes in for a specific tenant, rather than searching over the large graph containing all the segments, search can traverse the single smaller graph for that specific tenant.
Alternative to this can be by using filters with k-NN, but in filters we need to run the query and collect all the filters docs before going in the HNSW search. This adds extra latency and also reduces the recall if the tenant is very small.
@msokolov please let me know if you have more questions happy to ans that. :)
If the use case is multitenancy, it seems you would never want to search across tenants, so this would apply not only to KNN search but to all kinds of search? I agree the impact on KNN search is outsized, but would it make sense to build a separate index per tenant?
If the use case is multitenancy, it seems you would never want to search across tenants, so this would apply not only to KNN search but to all kinds of search? I agree the impact on KNN search is outsized, but would it make sense to build a separate index per tenant?
@msokolov multi-tenancy is just one use case and I added it as an example. Having separate indices make sense for smaller number of tenants but if the number of tenants goes to millions then multiple indices is not a great solution as a lot of pressure on the operating system in terms of managing files and directories.
If as a user I am just doing vector search and not other searches like text then this kind of solution is more intuitive and easy to use.
BTW there is another way to solve this problem, if the Lucene Codec Writers can get access to other fields, then we don't need to pass any information with vector. We can put let say a tenant id in BDV and then access it during the HNSWWriter to build the correct graph. WDYT about that?
What if we created a vector distance function like dot-product(v1, v2) * sgnum(d1, d2) where (v1, d1) and (v2, d2) are the (vector, cluster) pairs indexed together in the same document. With this, you would get a single graph made up of many smaller disjoint components. Then when you search, you would need to seed with appropriate entry points by doing a search for some document(s) matching the cluster. We'd also need to disable the way we forcibly connect disconnected components today. Possibly you could even encode the cluster as a subspace in the vector itself and avoid the need for another field that way. To do this, we'd have to find a way to extend the vector distance functions , or make it extensible. This is challenging, but at least it's something others have asked for and seems like a more natural kind of extension to me.
I remember (but I don't remember where) seeing someone doing multi-tenant vector search by using a flat vector index and enabling index sorting on the tenant ID. Then vector search can't take advantage of an advanced structure like HNSW, but the I/O access pattern is disk-friendly, so if each tenant isn't too large on its own the performance may be acceptable.
In general, I'm not a fan of this proposal of enabling creating multiple KNN indexes via some user-provided tenant/cluster ID. This looks like working around the fact that vector-search is currently not good at pre-filtering. I'd rather look into how we can make vector search better at pre-filtering (either with HNSW or something else).
I remember (but I don't remember where) seeing someone doing multi-tenant vector search by using a flat vector index and enabling index sorting on the tenant ID. Then vector search can't take advantage of an advanced structure like HNSW, but the I/O access pattern is disk-friendly, so if each tenant isn't too large on its own the performance may be acceptable.
In general, I'm not a fan of this proposal of enabling creating multiple KNN indexes via some user-provided tenant/cluster ID. This looks like working around the fact that vector-search is currently not good at pre-filtering. I'd rather look into how we can make vector search better at pre-filtering (either with HNSW or something else).
Thanks @jpountz for your thoughts. I do agree filtering has its limitations and that with HNSW and this where one to solve the problem building separate graphs was one option and then attaching them as a single unit. I will think over this more to see if there is a better way, or if we can fuse in the tenant information in the vector itself, resulting in vectors belonging to 1 tenant are somehow being closer.
What if we created a vector distance function like
dot-product(v1, v2) * sgnum(d1, d2)where(v1, d1)and(v2, d2)are the(vector, cluster)pairs indexed together in the same document. With this, you would get a single graph made up of many smaller disjoint components. Then when you search, you would need to seed with appropriate entry points by doing a search for some document(s) matching the cluster. We'd also need to disable the way we forcibly connect disconnected components today. Possibly you could even encode the cluster as a subspace in the vector itself and avoid the need for another field that way. To do this, we'd have to find a way to extend the vector distance functions , or make it extensible. This is challenging, but at least it's something others have asked for and seems like a more natural kind of extension to me.
@msokolov sorry for being late on the reply. If we are able to encode the cluster information(in this case tenant information) with vector say (vector, cluster) I think it will solve the problem on information representation part so that is pretty good.
Now on how to use the cluster information/extra information in distance computation is more vector like way to represent things may be this could be a default implementation.
But I would take it bit further where we open the extension point where Custom Codecs get the opportunity to influence further how this cluster logic can be used during node connection. Example can be nodes(aka Docs) of cluster 1 and cluster 2 can be connected but nodes of cluster 2 and 3 cannot be connected. Because with distances even though you can keep things as far as possible but they may popup in final search results.