langchain Improvements for ElasticsearchVectorStore

Added setup_index to the interface VectorStore for the procedure of creating and setting up the data schema
Added DataSchemaBuilder as a base interface for all vectorStores
Added a special data_schema_builder for elasticsearch for creating index mapping in elasticsearch
Added ElasticConf. A context to be passed to ElasticVectorStore to pass the credential context to the object (Not tested properly yet)
Added similarity_search_by_id and similarity_search_by_vector, this basically seperates the logic in similar_search more accurately in methods
Added VectorStoreFilter, a functionality to filter the metadata before ANN indexing
incorporated query filter instead of default elasticsearch filter
implemented concrete class ElasticFilter

close #834

Feb 06 '23 07:02 yasha-dev1

Still I haven't added code for unit tests. I will add, once the mentioned changes for the VectorStore itnerface are approved

Feb 06 '23 07:02 yasha-dev1

Hi, To answer the first question regarding how much generic are the operations done in base VectorStore: I went through the documentation of all subclasses of VectorStore. in this PR I have added setup_index, similarity_search_by_id, similarity_search_by_vector to the base class, so the new functionalies are creating an index or collection that represents a set of vectors, querying based on document ID, querying based on the vector. so If all the subclass vector stores are able to do that, then we can add it to base class.

And here's what I found out of each different datastores:

Weaviate

supports indexes, I believe the terminology is called Class object
supports querying based on ID
supports querying based on vector

Qdrant

supports indexes, I believe the terminology is called Collection
supports querying based on ID
supports querying based on vector

Pinecone

supports indexes
supports querying based on ID, Pinecone doesn't have ID foreach document, but it can be stored in metadata and generated from langchain if no doc_id is provided
supports querying based on vector

Milvus

supports indexes, I believe the terminology is called Collection
supports querying based on ID, again Milvus doesn't have ID foreach document, but it can be stored in metadata and generated from langchain if no doc_id is provided
supports querying based on vector using data field in query time

As for Faiss, I'm not sure, haven't checked it thoroughly, there is a concept of index but I don't think you can actually filter the data or search by metadata. please correct me on above data, but it seems most of the Vector stores already support the implemented functionality. My bad for the title though, it should have been something more generic, But I only implemented ElasticVectorStore, new methods are not implemented for other datastores

Feb 07 '23 08:02 yasha-dev1

@yasha-dev1 I like the idea of filtering within vector stores! I can help with Qdrant implementation. I'm wondering how we could build some more complex queries (even thinking about simple AND/OR statements).

I'd say the VectorStoreFilter should be common for all the engines, so from the user perspective there is no need to choose a different class if they decide to switch to a different provider. I imagine the VectorStoreFilter works as some sort of tree-like structure to keep all the criteria, and then they got parsed into the engine-specific structure. Either by a particular VectorStore or even better by some sort of engine parser, but definitely not in the VectorStoreFilter directly.

Feb 17 '23 11:02 kacperlukawski

@kacperlukawski Yes, VectorStoreFilter should be a general interface for all vector stores. The best solution which comes to my mind is to implement ES-like Interface. for now only And operator is supported, but we can work on making it more general with OR operator. sth like ES bool query Collaborations are more than welcome, I can add you as collaborator

Feb 17 '23 12:02 yasha-dev1

I too would like this functionality, is there any way I can help @yasha-dev1?

Mar 17 '23 14:03 SirBernardPhilip

langchain langchain copied to clipboard

Improvements for ElasticsearchVectorStore

langchain
langchain copied to clipboard