chatgpt-retrieval-plugin
chatgpt-retrieval-plugin copied to clipboard
Support OpenSearch k-NN as a vector datastore
OpenSearch supports approximate vector search powered by Lucene engine, nmslib engine, faiss engine and also bruteforce vector search using painless scripting functions. As OpenSearch is popular search engine, it would be good to have this available as one of the supported vector database
Hi team, I would like to contribute for doing this integration.
Please have a look at on issues w Lucene #52
Hi @sebastian-montero I have replied https://github.com/openai/chatgpt-retrieval-plugin/issues/52#issuecomment-1487400627.
Pasting the same response here.
OpenSearch supports various other engines apart from Lucene to do k-NN search. Ref: https://opensearch.org/docs/latest/search-plugins/knn/approximate-knn/
The Approximate k-NN search methods leveraged by OpenSearch use approximate nearest neighbor (ANN) algorithms from the [nmslib](https://github.com/nmslib/nmslib), [faiss](https://github.com/facebookresearch/faiss), and [Lucene](https://lucene.apache.org/) libraries to power k-NN search.
The Nmslib and Fasis doesn't have this limitation of 1024 dimensions. OpenSearch supports 16k dimensions. Code ref: https://github.com/opensearch-project/k-NN/blob/f6d3d40f5a29a4c54672e7fe7b76def71760c4de/src/main/java/org/opensearch/knn/index/util/KNNEngine.java#L36
Apart from this, the way OpenSearch and Elastic Search builds indexing and search request are very different. We should not combine them as 1 single datastore.
Hi @sebastian-montero I have replied #52 (comment).
Pasting the same response here.
OpenSearch supports various other engines apart from Lucene to do k-NN search. Ref: https://opensearch.org/docs/latest/search-plugins/knn/approximate-knn/
The Approximate k-NN search methods leveraged by OpenSearch use approximate nearest neighbor (ANN) algorithms from the [nmslib](https://github.com/nmslib/nmslib), [faiss](https://github.com/facebookresearch/faiss), and [Lucene](https://lucene.apache.org/) libraries to power k-NN search.The Nmslib and Fasis doesn't have this limitation of 1024 dimensions. OpenSearch supports 16k dimensions. Code ref: https://github.com/opensearch-project/k-NN/blob/f6d3d40f5a29a4c54672e7fe7b76def71760c4de/src/main/java/org/opensearch/knn/index/util/KNNEngine.java#L36
Apart from this, the way OpenSearch and Elastic Search builds indexing and search request are very different. We should not combine them as 1 single datastore.
I see OpenSearch from 2.4 supports metadata pre-filtering with the Lucene engine.
Does it also support metadata pre-filtering with any of the other supported engines that do not have the same vector size limitations?
This will have a useful application for that chatgpt-retrieval-plugin. For example, pre-filtering based on the user's provided organisation id or region.
@jordanparker6 So, nmslib and Faiss in OpenSearch support filtering, but those are only post-filters. The way customers can use filters on meta-data is by using post-filters. The only downside of this approach will be, post-filters can results in no data sometime when the results are very sparse. This can be avoided by increasing the value of K. Fassis has 10-15ms latency and nmslib has < 10ms latency for million dataset as per our benchmarks. Increasing the value of K lets say from 10 to 50 for using post-filters should suffice and not increase the latency.
Plus with the law of large numbers and OpenSearch dividing the dataset to multiple shards this should not be a prominent issue.
FYI : Lucene engine supports pre-filtering but it has the limitation of 1024 dimensions. Once the lucene PR is merged for increasing the dimensions this limitation will also be removed.
Also, to provide more context we have seen in our benchmarks that lucene performs well when data set is in few millions, the moment we go beyond 10 million documents, the performance of lucene degrades. This is where nmslib and Fassis scales well. Reference blog of Testing Fassis at billion scale: https://aws.amazon.com/blogs/big-data/choose-the-k-nn-algorithm-for-your-billion-scale-use-case-with-opensearch/
@jordanparker6 @sebastian-montero Could you please help get review? We would love to see this integration for OpenSearch community