tantivy
tantivy copied to clipboard
(Approximate) Nearest Neighbour / Vector Similarity Search
Is your feature request related to a problem? Please describe. Nowaday search can be based by BM25 and semantic search. Using cosine similarity with a provided vector at index time we could ponderate BM25 and vector distance to score. I would say that vector is provided at index time and at search time. One document to many indexs by sentence and one query with a vector.
Describe the solution you'd like Integrate Faiss (on any other vector search library) onto tantivy so the merge is done at tantivy at search time.
[Optional] describe alternatives you've considered Right now we have an external vector and tantivy and we merge at middleware but may be an interesting feature to discuss about.
I'm just evaluating if it makes sense.
Using embedding for semantic search is a hot topic.
I might be a bit premature for tantivy right now (I do not know many successful implementation in the industry), but this is something we might want to revisit later.
If you do not have a very specific use case, let's keep this ticket open for the moment and revisit it later.
My initial idea was (from the API point of view) to provide a new field type that can index on a vector index (like https://docs.rs/crate/hnsw/0.2.0/source/README.md) on the same "transaction" and on search beeing able to search by vector similarity. How the vector is computed from the text should be outside of the indexer (imho)
Another approach (for text) is https://github.com/AdeDZY/DeepCT, which uses ML to weigh terms in documents & queries, and then uses bm25 with those weights.
https://microsoft.github.io/msmarco/ is good to keep an eye on, the current full ranking entries seem to be inverted index with some ml for query expansion or indexing + language model re-ranking.
@acertain this sounds super interesting !
I am trying to do these. Let them work together, not one contains the other, But I found that more transformation may be on faiss.
I am very interested in this too and would love to work together. I have research experience in both DeepCT and vector-based search.
@snakeztc Could you lead the development of such a feature? Also DeepCT and dense vector nearest neighbor are very different problem. Which one do you need?
This is a very valuable feature, but it adds up a lot of complexity so the condition for it to be merged are:
- a) it should be behind a feature flag.
- b) it should not be half baked.
- c) it should be well tested
- d) it should have an actual user.
I can help with the design/code review provided we are shooting for eventually shipping this.
Thata's on my dev list now. I am going to shoot for Deep CT sparse vector instead of dense vector, since that can be done by reusing the current inverted index. Any help on design orr code review would be greatly appreciated. @fulmicoton
I am interested in testing/using it. Any update?
Any updates on this? With Pinecone being under deathly strain, I've begun looking into this lib and other semantic index libs for a DIY solution
We finally implemented an index with a similar approach to tantivy with a variant of HNSW at NucliaDB Vectors
We are using ANN on Tantivy in prod with an implementation of what Faiss calls IVFFlat
. It is not open-source ready at this point. Sharing some details for the curious!
- ANN index built offline (seaparate from Tantivy), K-Means clustering using linfa-clustering. The ANN index is bincode-serialized and contains the trained centroids and entityID to clusterID (centroid) assignments. Vector data is separately serialized, each cluster's vectors are stored contiguously.
- Serving pods load the index into memory, vector data is memory-mapped.
- Tantivy
Warmer
is used to maintain the ANN index as Tantivy segment-level state --ClusterId -> [DocId]
,DocId -> VectorIdx
. - Custom
AnnQuery
(tantivyQuery
implementation) leverages the warmed state for matching and scoring. It computes the nearest clusters for the query vector, and uses desired probing % parameter to pick the topK clusters to match for itsDocSet
. TheScorer
does a dot product of the query vector with that document's vector. We are able to easily combine theAnnQuery
with other clauses for filtering.
any updates? is this feature a WIP or in plan?
Not planned at all.