LocalAI
LocalAI copied to clipboard
feat: Vector search
Unless we have an Open Source LLM with a 1M+ token context length then we need vector search for the assistant API: https://github.com/mudler/LocalAI/issues/1273#issuecomment-1971736182
Even with a very large context length it is still far cheaper to use vector search with embeddings. It can all easily be done on CPU.
Implementation
I see three main options for adding vector search:
- Simple in-memory brute force search. We regenerate the embeddings instead of saving them to storage.
- Add one or more vector databases as a backend
- Connect to an external database
The first is easy to implement and doesn't have any upkeep because we flush everything after a restart. If we want to change chunking size or any hyperparameter it has the same cost doing a restart. There is plenty of prior art in Go:
- https://github.com/marekgalovic/anndb
- https://github.com/aws-samples/gofast-hnsw/?tab=readme-ov-file#brute-search-performance
- Milvus, Weaviate, Gorse
Even implementing HNSW or Annoy would not be difficult. The main problems I see are the classic database issues. So I am in favor of doing 1. or 3. no in-between. Although saving embeddings to a flat file could be OK, just not on the first iteration.
I did make an experiment using BadgerDB, but talked myself out of it: https://github.com/richiejp/badger-cybertron-vector/blob/main/main.go. The problem is that it complicates comparing the vectors and then we also have to maintain state between restarts.
API
Obviously we will follow the OpenAI API as in #1273, but I think it would also make sense to have some API to do simple search without an LLM. Just so people can do fuzzy search with LocalAI instead of reaching for another tool. Suggestions for how this API should look welcome.
Personally, my thought is that we should aim for something like 2... in order to get both 1 and 3. I think we should set up an interface that we require from a vector search system first - and then allow the user to select their vector search backend via configuration. I'll definitely need to do some research to see if what I'm proposing even makes sense - but I assume that no matter the vector search backend, the interface we'll need to interact with should be fairly constant.
I'm assuming that in many production cases, people will want to use an external vector search database, as they will definitely have better performance than anything we make :D
However, for the sake of our tests and quick development cycles, I like the idea of a really quick "in memory" backend - the fewer external dependencies in that case the better.
Notably, I don't think these should be exactly the same as our gRPC generation backends - this might be better accomplished with a simple go interface.
I went ahead and added to the gRPC backend before seeing your post (I'll create a WIP PR shortly). Possibly it's too much of a break from the existing backends and is overloading the interface. However my feeling is also that a simple interface can cover most use cases as you say. I created an interface that is similar to a basic key-value store and is column orientated like most the vector databases I have seen.
It won't cover hybrid (i.e. non vector) searches and creating indexes. I can see uses for that, but for now I hope it is enough just to split the entries into groups.
service Backend {
...
rpc StoresSet(StoresSetOptions) returns (Result) {}
rpc StoresDelete(StoresDeleteOptions) returns (Result) {}
rpc StoresGet(StoresGetOptions) returns (StoresGetResult) {}
rpc StoresFind(StoresFindOptions) returns (StoresFindResult) {}
}
message StoresKey {
// TODO: Add shard/database/file ID to separate unrelated embeddings
repeated float Floats = 1;
}
message StoresValue {
bytes Bytes = 1;
}
message StoresSetOptions {
repeated StoresKey Keys = 1;
repeated StoresValue Values = 2;
}
message StoresDeleteOptions {
repeated StoresKey Keys = 1;
}
message StoresGetOptions {
repeated StoresKey Key = 1;
}
message StoresGetResult {
repeated StoresKey Keys = 1;
repeated StoresValue Values = 2;
}
message StoresFindOptions {
StoresKey Key = 1;
int32 TopK = 2;
}
message StoresFindResult {
repeated StoresKey Keys = 1;
repeated StoresValue Values = 2;
repeated float Similarities = 3;
}
The PR now implements an internal gRPC API with vector search. The next step is to create an HTTP API which mirrors the gRPC one in my current thinking. Then some e2e testing with an external script can be done or with the HTTP Go tests.
I added a HTTP API which mirrors the gRPC API and some very basic tests for that.
Ah, and now I see colBERT: https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/
Probably at the very least an ID field is needed so that the embedding vector is not being used as an ID.