sqlite-vec icon indicating copy to clipboard operation
sqlite-vec copied to clipboard

Tracking issue: ANN (Approximate Nearest Neighbors) Index

Open asg017 opened this issue 1 year ago • 16 comments

sqlite-vec as of v0.1.0 will be brute-force search only, which slows down on large datasets (>1M w/ large dimensions). I want to include some form of approximate nearest neighbors search before v1, which trades accuracy/resource usage for speed.

This issue is a general "tracking issue" for how ANN will be implemented in sqlite-vec. The open questions I have:

Which ANN index should we use?

We want something that fits well with SQLite - meaning storing data in shadow tables, data that fits in pages, low memory usage, etc.

The main options I see:

  • IVF: Like Faiss IndexIVF, pre-compute centroids of a training dataset, search nearest N centroids, etc.
  • HNSW: Should "scale" more, way more params to tune. Would be pretty complicated to implement
  • DiskANN: Kinda dig the simplicitly of this, especially LM-DiskANN

Unsure which one will turn out best, will need to reseach more. It's possible we add support for all these options.

How should one "declare" an index?

SQLite doesn't have custom indexes, so I think the best way would be to include index info in the CREATE VIRTUAL TABLE constructor. Like:

create virtual table vec_movies(
  synopsis_embeddings float[768] INDEXED BY diskann(...)
);

or:

create virtual table vec_movies(
  synopsis_embeddings float[768] index=hnsw(...)
);

syntax heavily depends what ANN index we pick. Also how would training work?

How would they work with metadata filtering?

How do we allow bruteforce + ANN on the same table?

How do we pick between KNN/ANN in a SQL query?

asg017 avatar Jun 21 '24 05:06 asg017

Would ustream work?

https://github.com/unum-cloud/usearch

They even have some sqlite stuff already

https://github.com/unum-cloud/usearch/blob/main/sqlite/README.md

neilkumar avatar Jun 25 '24 21:06 neilkumar

usearch is great! But they don't offer many "hooks" to their storage engine, which would be required for sqlite-vec. We'd want to store the index inside SQLite tables, and balance query time + random lookup times. Also the usearch SQLite functions are just scalar functions, nothing that accesses the HNSW index

Also I want to keep sqlite-vec as lightweight as possible, there's no outside dependencies and is a single .c/.h file. So I don't wanna complicate things with a C++ dependency

asg017 avatar Jun 25 '24 21:06 asg017

+1 for LM-DiskANN

irowberryFS avatar Aug 29 '24 15:08 irowberryFS

yoo bros, lets do this index okay :))

di-sukharev avatar Sep 02 '24 12:09 di-sukharev

@irowberryFS Interestingly, libSQL uses LM-DiskANN for the vector index: https://turso.tech/blog/approximate-nearest-neighbor-search-with-diskann-in-libsql

Boscop avatar Sep 23 '24 23:09 Boscop

For those of you who want this, I'd love to hear your thoughts on these questions:

  • When does the vec0 brute-force approach start to break down for you? How many vectors are you storing and how many dimensions, and how long does a KNN search take?
  • Have you tried scalar/binary quantization to reduce search times? Is the quality reduction good-enough for your use-case?
  • Would "fast inserts" be required in your applications? Some ANN indexes (ex IVF) require a "training process" before inserting real vectors that can take several seconds or minutes, and other ANN indexes can insert in "real-time" (ex DiskANN) but are individually slow as they require a graph traversal. Would your setup prefer one approach over the other?

asg017 avatar Oct 08 '24 18:10 asg017

+1 for LM-DiskANN

imotai avatar Oct 09 '24 10:10 imotai

When does the vec0 brute-force approach start to break down for you? How many vectors are you storing and how many dimensions, and how long does a KNN search take?

Short answer

Brute force scales up to 8000 500-word pages or 800 10-page articles on commodity hardware.

Longer answer

Assuming retrieval should take no longer than 100ms, the brute force approach can handle about 250k embeddings on commodity hardware:

import numpy as np

N = 250_000  # Number of embeddings
d = 1024  # The 'median' embedding dimension [1]

embeddings = np.random.randn(N, d).astype(np.float32)
query = np.random.randn(d).astype(np.float32)

%timeit embeddings @ query  # Core of the brute force approach

# Output on a Google Colab instance:
# 85 ms ± 9.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And assuming you store one embedding per sentence (which would be standard in multi-vector retrieval), that's the equivalent of about 8000 500-word pages, or 800 10-page articles, or 40 200-page books.

Have you tried scalar/binary quantization to reduce search times?

I see binary embeddings as an optimisation, not a substitute for floating point embeddings. This optimisation could help to scale brute force with another factor of 2 without sacrificing retrieval quality [2], but definitely not an order of magnitude. For that, we'd need an ANN index.

Would "fast inserts" be required in your applications?

I'd be happy to sacrifice some insert speed in favour of sub-100ms near-100% recall retrieval for 1–10M+ embedding datasets.

[1] https://huggingface.co/spaces/mteb/leaderboard [2] https://blog.pgvecto.rs/my-binary-vector-search-is-better-than-your-fp32-vectors

lsorber avatar Oct 09 '24 20:10 lsorber

I'd like to throw my hat into the ring for LM-DiskANN.

Brute force search works great in LanceDB because it uses a contiguous columnar format. In SQLite there is an upper bound to performance because the data is inherently fragmented. By using DiskANN, you can actually leverage this fragmentation and use the B-tree to do the lookups for nearest neighbors. It seems uniquely suited to SQLite's storage engine.

Further, If you store vectors and edges in separate columns, you could use SQLite to find nodes without edges computed and then integrate them into the graph incrementally. In my understanding, this was one of the "flaws" of sqlite-vss: the FAISS index is one big blob and SQLite is not designed to store one big blob, hence the option for faiss_ondisk.

You could even use libSQL's implementation as a starting point, you'd just need to un-libSQL-ify it.

conradev avatar Oct 31 '24 00:10 conradev

In addition to Turso using DiskANN, it looks like Microsoft is already using this architecture in production in their Recall product: https://x.com/simonw/status/1798383577421525276

They even store vectors and edges in separate columns, like I suggested!

conradev avatar Oct 31 '24 01:10 conradev

@conradev agreed, I think DiskANN will eventually be the way to go!

I'm working on metadata filtering right now (#124 ), aiming to get it out in the next 2-3 weeks. After that's done, I'll be focusing on bring ANN algorithms to sqlite-vec.

Here's my rough plan for this:

  1. Start with an IVF+kmeans index, because (after training) inserts and KNN queries are pretty quick, and should help scale sqlite-vec to at least ~1 million vectors
  2. Add a proper DiskANN index, which won't require kmeans training

I'm wary of DiskANN because, in my experience, writes can be pretty slow due to the internal pruning process. I've tried libsql's implementations and building the initial index was pretty slow, even compared to vec0 virtual tables (which are slow compared to regular tables). Also DiskANN implementations can be pretty complex, while IVF will likely be straightforward.

I can imagine a future world where sqlite-vec has other ANN index attempts (HNSW, SPANN, etc), both those rely pretty heavily on in-memory data structures and like won't map well to SQLite shadow tables. And they're super complex!

In terms of timeline, metadata filters will come out mid-November, so I expect to work on ANN indexes November=December. Hopefully I can get some beta builds out during then, but I won't expect fully-supported ANN inside sqlite-vec until December/January.

In the meantime - if anyone else wants to share benchmark numbers of any papers/implementations of DiskANN or other ANN indexes, feel free to share in this thread!

asg017 avatar Oct 31 '24 20:10 asg017

Recent discussion on SPANN https://news.ycombinator.com/item?id=42028873

jlarmstrongiv avatar Nov 03 '24 14:11 jlarmstrongiv

This is the go to repo for ANN benchmarks: https://github.com/erikbern/ann-benchmarks

SeanPedersen avatar Nov 27 '24 02:11 SeanPedersen

+1 for LM-DiskANN

andreasmhahn avatar Dec 17 '24 10:12 andreasmhahn

In terms of timeline, metadata filters will come out mid-November, so I expect to work on ANN indexes November=December. Hopefully I can get some beta builds out during then, but I won't expect fully-supported ANN inside sqlite-vec until December/January.

I'd like to ask when the ANN feature will be released, I have over 2M rows of data right now, and I'm experiencing more noticeable performance issues when recalling vectors with 1024 dimensions, I'd like to try out the new index type as soon as possible if I can

warrenwyf avatar Mar 09 '25 15:03 warrenwyf

Might be of interest. https://github.com/timescale/pgvectorscale

jrandolf avatar Jul 15 '25 02:07 jrandolf