[DISCUSS] Could we have a different ANN algorithm for Learned Sparse Vectors?
Description
Learned Sparse Vectors claim to combine the benefits of sparse (i.e. lexical) and dense (i.e. vector) representations
From https://en.wikipedia.org/wiki/Learned_sparse_retrieval:
Learned sparse retrieval or sparse neural search is an approach to text search which uses a sparse vector representation of queries and documents.[1] It borrows techniques both from lexical bag-of-words and vector embedding algorithms, and is claimed to perform better than either alone. The best-known sparse neural search systems are SPLADE[2] and its successor SPLADE v2
From https://zilliz.com/learn/enhancing-information-retrieval-learned-sparse-embeddings:
Learn sparse embeddings denote sparse vector representations of data crafted through sophisticated machine learning models such as SPLADE and BGE-M3. Unlike traditional sparse vectors, which rely solely on statistical methods like BM25, learned sparse embeddings enrich the sparse representation with contextual information while retaining keyword search capabilities. They can discern the significance of adjacent or correlated tokens, even if not explicitly present in the text, resulting in a "learned" sparse representation adept at capturing relevant keywords and classes.
A famous model for such sparse representations of documents is SPLADE (https://github.com/naver/splade)
Paper
Came across Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations (https://arxiv.org/pdf/2404.18812) which shows promising benchmarks for KNN search over learned sparse vectors:
Experimental results show average per-query latency in microsecond territory on various sparse embeddings of Ms Marco. Impressively, Seismic outperforms the graph-based winning solutions of the BigANN Challenge by a factor of at least 3.4 at 95% accuracy on Splade and 12 on Efficient Splade, with the margin widening substantially as accuracy increases. Other baselines, including state-of-the-art inverted index-based algorithms, are consistently one to two orders of magnitude slower than Seismic
Figure 3 in the linked paper summarizes the design of the algorithm:
The design of Seismic. Inverted lists are independently partitioned into geometrically-cohesive blocks. Each block is a set of document identifiers with a summary vector. The inner product of a query with the summary approximates the inner product attainable with the documents in that block. The forward index stores the complete vectors (including values).
Learned Sparse Vectors seem to be naturally compatible with inverted indexes, and many aspects of the algorithm are already implemented in Lucene Could we use this for faster KNN search when sparse vectors are used?
What I wonder is: how can Lucene help with this? I feel like we have all the primitives available to enable Splade-style search and retrieval, but maybe there is something missing? IIRC there needs to be a per-term score, but I think we do have the ability to store custom term frequencies and to override similarity in order to do the appropriate combination of these term scores, so I think those are the ingredients needed for this. Maybe it's a case of trying it and seeing if there are some features it would be helpful to embed in the Lucene layer, or if indeed we can build this "on top"?
There might be a better format than just terms. But I would assume the bipartite graph stuff would help here.
Additionally, I would expect the most benefits to be made at query time. Looking at the newer research out of the splade folks, they are optimizing the query time by adjusting the scoring, not really the storage.
Maybe a better query would be a good first step.
I found this recent paper by well-known people in the IR efficiency space quite interesting: https://arxiv.org/pdf/2405.01117. It builds on inverted indexes and simple/intuitive ideas:
- BP reordering, that Ben alluded to and that Lucene already supports, it naturally clusters documents with similar terms together,
- Block-max WAND, which Lucene supports,
- Anytime ranking on document ordered indexes (https://arxiv.org/pdf/2104.08976), ie. ranking ranges of doc IDs that have the best impact scores first in order to optimize pruning. Something Lucene doesn't support at the moment but that look doable and generally useful.
- Unsafe top-k search via termination conditions and skipping blocks that are barely more competitive than the current top-k-th hit.
- Query term pruning, which sounds like a good idea in general for learned sparse retrieval when the model generates many terms.
- Scoring high-frequency / low-scoring terms via a forward index instead of an inverted index.
I have recently been interested in this direction and plan on spending non trivial amount of time on this over the next few weeks. Assuming we haven't started dev on this, I am assigning it to myself.
What I have been hacking on is impact sorted prioritisation of docIDs during the ranking phase (especially for custom rankers), so that would probably be the first thing that comes out of this ticket.
I have recently been interested in this direction and plan on spending non trivial amount of time on this over the next few weeks. Assuming we haven't started dev on this, I am assigning it to myself.
What I have been hacking on is impact sorted prioritisation of docIDs during the ranking phase (especially for custom rankers), so that would probably be the first thing that comes out of this ticket.
@atris may I know if there's any progress on this?
Yes, been playing with some stuff. Should be able to get something related up for review soon
Hi @atris , thanks for your contribution.
I found this paper (Bridging Dense and Sparse Maximum Inner Product Search) pretty interesting as it caters to the skip list structure within Lucene: https://arxiv.org/pdf/2309.09013. We are considering to implement this algorithm. If possible, can we have a call to discuss?