dragonfly icon indicating copy to clipboard operation
dragonfly copied to clipboard

Feature Request: 1-bit Embedding search

Open samueldashadrach opened this issue 11 months ago • 9 comments

What?

DragonflyDB supports vector embedding search using FT.SEARCH command. However it only supports embedding search where each vector has weights of float32 datatype (4 bytes per float).

Please prioritise supporting embedding search using 1-bit weights. It would also be nice to support intermediate levels like 2-bit and 4-bit embeddings, but 1-bit is what I'm most interested in.

It is not mandatory for you to provide any logic for how to generate or quantise the embeddings. Search is what is most important.

It is okay if your implementation has a search time that is 50 milliseconds worse than state-of-the-art. It will still be useful if it is as easy-to-use as the rest of dragonflyDB. Ease-of-use is an important factor that some libraries fail on.

Why?

In practice, quantising embedding drastically reduces RAM requirements without significantly loss in accuracy, hence many devs are quantising their embeddings instead.

  • Accuracy loss is not significant, there is empirical data you can google on this. Also I've observed this in my own dataset.
  • Hosting 32-bit embeddings versus 1-bit embeddings can mean the difference between paying $10,000/month and $300/month in hosting costs. Devs working on independent projects or in small startups will obviously pick the latter.

This is one of the highest revenue generating use cases of the AI revolution. Perplexity recently raised $500 million at $8 billion valuation just to build more embedding search. Pinecone is the most popular hosted solution for embedding search. It was last valued at $750 million.

You can use Perplexity or Claude or ChatGPT yourself to confirm that RAG and embedding search add significant real value to user's lives, they are not just buzz words.

Whichever library is able to support devs for this use case is very likely to increase in popularity.

Alternatives

I have been using Pinecone for smaller datasets as it is the fastest way to host embedding today IMO. However it is expensive for large datasets. Also it does not support 1-bit embeddings yet.

FAISS library by Facebook is currently one of the easy-to-use open source libraries for someone who wants to use 1-bit embedding search.

ANN Benchmarks tries to benchmark existing embedding search libraries. Not all these are easy to use though. And not all of them support quantisation.

samueldashadrach avatar Jan 27 '25 02:01 samueldashadrach

Hi @samueldashadrach , thanks for your request. I will proceed with that and get back to you later regarding whether we will add this or not

BagritsevichStepan avatar Jan 28 '25 10:01 BagritsevichStepan

Hi @BagritsevichStepan

I tried implementing this myself using BITOP XOR operations on dragonflyDB, using popcount on client side on CPU, and using lua scripting. All of them ran into the common problem namely that dragonflyDB expects passing every single key name over network, which means too many roundtrips.

Is there a general way to perform some simple operation (addition, xor, etc) on a huge number of keys without being slowed down by network? This feels like a limitation of redis/dragonflyDB rather than a limitation of the hardware.

samueldashadrach avatar Mar 31 '25 07:03 samueldashadrach

Can you write here the algorithm that you try to implement?

romange avatar Mar 31 '25 07:03 romange

@romange Thanks for replying

Given ~300M 64bit values (and keys) and a query 64bit value, find the 10 keys whose values have lowest Hamming distance with query value.

Approaches I tried:

  • put all 64bit values in a dragonfly Set, store a separate hash that maps values to keys. On receiving query: perform SSCAN, then perform BITOP XOR with each value and query value and store in a new hash (say ALL_RESULTS), then perform BITCOUNT on that result. at the end of every SSCAN batch, check which BITCOUNT returned are higher than threshold d and save those keys locally.
  • same as above but I put the BITOP XOR and BITCOUNT operation inside a lua script. Unfortunately I realised this means I need to pass all the 300M values from client to server anyway, so it didn't help a lot.
  • same as above but do xor and popcount inside a local script (perl script), use dragonflyDB only for storing the values. after SSCAN, just do GET.

I am aware there are faster algos out there for searching lowest hamming distance. The general problem I want to solve is how to iterate through a large number of keys and do some simple operation with them. As tomorrow I may want to implement a different algo.

samueldashadrach avatar Mar 31 '25 11:03 samueldashadrach

But with option (1) you have only 2 keys, right? i..e SET with 300M items and ALL_RESULTS

romange avatar Mar 31 '25 11:03 romange

You can avoid predeclaring keys if you put

--!df flags=allow-undeclared-keys,disable-atomicity

at the beginning of the script: https://www.dragonflydb.io/docs/managing-dragonfly/scripting#script-flags

romange avatar Mar 31 '25 11:03 romange

@romange Oh! Thanks. I was missing this. Will try with this and let you know if it is fast enough.

samueldashadrach avatar Mar 31 '25 12:03 samueldashadrach

probably not fast for search. 300M is 300M. did you quantise vectors on your side? so instead of say, 768 bytes you have bunch of already quantised 64 bit vectors?

romange avatar Mar 31 '25 12:03 romange

@romange I'm trying to implement an algo called distance-sensitive bloom filter. It is not a state-of-the-art algo for embedding search, but it is conceptually simple. google scANN and microsoft diskANN are closer to state-of-the-art. I'm generally new to embedding search so I'm interested in implementing different algos to get practical experience.

https://github.com/harsha-simhadri/big-ann-benchmarks big ann benchmarks has benchmarks, this table will probably look different 6 months from now. So I'm not very locked in on using any particular algo.

The common idea in most of the algos is to split the entire vector dataset into many parts. Then given a query, quickly identify which parts are more likely to have matches, and then consume more time and memory to do search in those parts.


dsbf: Generate H hyperplanes at the start. For each vector store one bit per hyperplane indicating which side of the hyperplane it is, to get a H-bit signature. (The hyperplanes have basically divided the original space into 2^H regions now) Given query signagture, find matching signatures with low hamming distance. This has some false positive rate, so you have to run reranking on the returned results.

I was trying to do this with H=64.

google scann: split vectors into subvectors (so say 1536 dims -> 8 subvectors of 192 dims each), then run k-means clustering on each subvector space (total 8 lets say) to get k centroids. for each vector: store index of its nearest centroid and distance from the centroid. Given query subvectors find nearest centroids and nearest vectors to those centroids. Here too there will be false positive rate, so reranking needed

hnsw: stores vectors in a graph. maintain two separate types of edges "far" and "near" for far neighbours and near neighbours respectively. Use first type of edges for a coarse-grained search and get roughly which regions have good matches, then use second type of edges for more fine-grained search.

samueldashadrach avatar Mar 31 '25 12:03 samueldashadrach