tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

(Approximate) Nearest Neighbour / Vector Similarity Search

Open bloodbare opened this issue 4 years ago • 17 comments

Is your feature request related to a problem? Please describe. Nowaday search can be based by BM25 and semantic search. Using cosine similarity with a provided vector at index time we could ponderate BM25 and vector distance to score. I would say that vector is provided at index time and at search time. One document to many indexs by sentence and one query with a vector.

Describe the solution you'd like Integrate Faiss (on any other vector search library) onto tantivy so the merge is done at tantivy at search time.

[Optional] describe alternatives you've considered Right now we have an external vector and tantivy and we merge at middleware but may be an interesting feature to discuss about.

I'm just evaluating if it makes sense.

bloodbare avatar Apr 19 '20 17:04 bloodbare

Using embedding for semantic search is a hot topic.

I might be a bit premature for tantivy right now (I do not know many successful implementation in the industry), but this is something we might want to revisit later.

If you do not have a very specific use case, let's keep this ticket open for the moment and revisit it later.

fulmicoton avatar Apr 27 '20 02:04 fulmicoton

My initial idea was (from the API point of view) to provide a new field type that can index on a vector index (like https://docs.rs/crate/hnsw/0.2.0/source/README.md) on the same "transaction" and on search beeing able to search by vector similarity. How the vector is computed from the text should be outside of the indexer (imho)

bloodbare avatar Apr 27 '20 08:04 bloodbare

Another approach (for text) is https://github.com/AdeDZY/DeepCT, which uses ML to weigh terms in documents & queries, and then uses bm25 with those weights.

https://microsoft.github.io/msmarco/ is good to keep an eye on, the current full ranking entries seem to be inverted index with some ml for query expansion or indexing + language model re-ranking.

acertain avatar Jun 11 '20 01:06 acertain

@acertain this sounds super interesting !

fulmicoton avatar Jun 16 '20 06:06 fulmicoton

I am trying to do these. Let them work together, not one contains the other, But I found that more transformation may be on faiss.

ansjsun avatar Jul 23 '20 07:07 ansjsun

I am very interested in this too and would love to work together. I have research experience in both DeepCT and vector-based search.

snakeztc avatar Nov 14 '20 03:11 snakeztc

@snakeztc Could you lead the development of such a feature? Also DeepCT and dense vector nearest neighbor are very different problem. Which one do you need?

This is a very valuable feature, but it adds up a lot of complexity so the condition for it to be merged are:

  • a) it should be behind a feature flag.
  • b) it should not be half baked.
  • c) it should be well tested
  • d) it should have an actual user.

I can help with the design/code review provided we are shooting for eventually shipping this.

fulmicoton avatar Nov 16 '20 01:11 fulmicoton

Thata's on my dev list now. I am going to shoot for Deep CT sparse vector instead of dense vector, since that can be done by reusing the current inverted index. Any help on design orr code review would be greatly appreciated. @fulmicoton

snakeztc avatar Dec 06 '20 16:12 snakeztc

I am interested in testing/using it. Any update?

AlexMikhalev avatar Jul 29 '21 10:07 AlexMikhalev

Any updates on this? With Pinecone being under deathly strain, I've begun looking into this lib and other semantic index libs for a DIY solution

davit-b avatar Apr 15 '23 21:04 davit-b

We finally implemented an index with a similar approach to tantivy with a variant of HNSW at NucliaDB Vectors

bloodbare avatar Apr 16 '23 21:04 bloodbare

We are using ANN on Tantivy in prod with an implementation of what Faiss calls IVFFlat. It is not open-source ready at this point. Sharing some details for the curious!

  • ANN index built offline (seaparate from Tantivy), K-Means clustering using linfa-clustering. The ANN index is bincode-serialized and contains the trained centroids and entityID to clusterID (centroid) assignments. Vector data is separately serialized, each cluster's vectors are stored contiguously.
  • Serving pods load the index into memory, vector data is memory-mapped.
  • Tantivy Warmer is used to maintain the ANN index as Tantivy segment-level state -- ClusterId -> [DocId], DocId -> VectorIdx.
  • Custom AnnQuery (tantivy Query implementation) leverages the warmed state for matching and scoring. It computes the nearest clusters for the query vector, and uses desired probing % parameter to pick the topK clusters to match for its DocSet. The Scorer does a dot product of the query vector with that document's vector. We are able to easily combine the AnnQuery with other clauses for filtering.

shikhar avatar Apr 17 '23 19:04 shikhar

any updates? is this feature a WIP or in plan?

bladehliu avatar Feb 23 '24 01:02 bladehliu

Not planned at all.

fulmicoton avatar Feb 23 '24 10:02 fulmicoton