langchain
langchain copied to clipboard
Improvements for ElasticsearchVectorStore
- Added
setup_index
to the interfaceVectorStore
for the procedure of creating and setting up the data schema - Added
DataSchemaBuilder
as a base interface for allvectorStores
- Added a special
data_schema_builder
for elasticsearch for creating index mapping in elasticsearch - Added
ElasticConf
. A context to be passed toElasticVectorStore
to pass the credential context to the object (Not tested properly yet) - Added
similarity_search_by_id
andsimilarity_search_by_vector
, this basically seperates the logic insimilar_search
more accurately in methods - Added
VectorStoreFilter
, a functionality to filter the metadata before ANN indexing - incorporated query filter instead of default elasticsearch filter
- implemented concrete class
ElasticFilter
close #834
Still I haven't added code for unit tests. I will add, once the mentioned changes for the VectorStore
itnerface are approved
Hi,
To answer the first question regarding how much generic are the operations done in base VectorStore
:
I went through the documentation of all subclasses of VectorStore
.
in this PR I have added setup_index
, similarity_search_by_id
, similarity_search_by_vector
to the base class, so the new functionalies are creating an index or collection that represents a set of vectors
, querying based on document ID
, querying based on the vector
. so If all the subclass vector stores are able to do that, then we can add it to base class.
And here's what I found out of each different datastores:
Weaviate
-
supports indexes, I believe the terminology is called
Class object
- supports querying based on ID
- supports querying based on vector
Qdrant
-
supports indexes, I believe the terminology is called
Collection
- supports querying based on ID
- supports querying based on vector
Pinecone
- supports indexes
- supports querying based on ID, Pinecone doesn't have ID foreach document, but it can be stored in metadata and generated from langchain if no doc_id is provided
- supports querying based on vector
Milvus
-
supports indexes, I believe the terminology is called
Collection
- supports querying based on ID, again Milvus doesn't have ID foreach document, but it can be stored in metadata and generated from langchain if no doc_id is provided
- supports querying based on vector using data field in query time
As for Faiss, I'm not sure, haven't checked it thoroughly, there is a concept of index but I don't think you can actually filter the data or search by metadata.
please correct me on above data, but it seems most of the Vector stores already support the implemented functionality.
My bad for the title though, it should have been something more generic, But I only implemented ElasticVectorStore
, new methods are not implemented for other datastores
@yasha-dev1 I like the idea of filtering within vector stores! I can help with Qdrant implementation. I'm wondering how we could build some more complex queries (even thinking about simple AND/OR statements).
I'd say the VectorStoreFilter
should be common for all the engines, so from the user perspective there is no need to choose a different class if they decide to switch to a different provider. I imagine the VectorStoreFilter
works as some sort of tree-like structure to keep all the criteria, and then they got parsed into the engine-specific structure. Either by a particular VectorStore
or even better by some sort of engine parser, but definitely not in the VectorStoreFilter
directly.
@kacperlukawski Yes, VectorStoreFilter
should be a general interface for all vector stores. The best solution which comes to my mind is to implement ES-like Interface. for now only And
operator is supported, but we can work on making it more general with OR
operator. sth like ES bool query
Collaborations are more than welcome, I can add you as collaborator
I too would like this functionality, is there any way I can help @yasha-dev1?