langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Improvements for ElasticsearchVectorStore

Open yasha-dev1 opened this issue 2 years ago • 4 comments

  • Added setup_index to the interface VectorStore for the procedure of creating and setting up the data schema
  • Added DataSchemaBuilder as a base interface for all vectorStores
  • Added a special data_schema_builder for elasticsearch for creating index mapping in elasticsearch
  • Added ElasticConf. A context to be passed to ElasticVectorStore to pass the credential context to the object (Not tested properly yet)
  • Added similarity_search_by_id and similarity_search_by_vector, this basically seperates the logic in similar_search more accurately in methods
  • Added VectorStoreFilter, a functionality to filter the metadata before ANN indexing
  • incorporated query filter instead of default elasticsearch filter
  • implemented concrete class ElasticFilter

close #834

yasha-dev1 avatar Feb 06 '23 07:02 yasha-dev1

Still I haven't added code for unit tests. I will add, once the mentioned changes for the VectorStore itnerface are approved

yasha-dev1 avatar Feb 06 '23 07:02 yasha-dev1

Hi, To answer the first question regarding how much generic are the operations done in base VectorStore: I went through the documentation of all subclasses of VectorStore. in this PR I have added setup_index, similarity_search_by_id, similarity_search_by_vector to the base class, so the new functionalies are creating an index or collection that represents a set of vectors, querying based on document ID, querying based on the vector. so If all the subclass vector stores are able to do that, then we can add it to base class.

And here's what I found out of each different datastores:

Weaviate

Qdrant

Pinecone

Milvus

  • supports indexes, I believe the terminology is called Collection
  • supports querying based on ID, again Milvus doesn't have ID foreach document, but it can be stored in metadata and generated from langchain if no doc_id is provided
  • supports querying based on vector using data field in query time

As for Faiss, I'm not sure, haven't checked it thoroughly, there is a concept of index but I don't think you can actually filter the data or search by metadata. please correct me on above data, but it seems most of the Vector stores already support the implemented functionality. My bad for the title though, it should have been something more generic, But I only implemented ElasticVectorStore, new methods are not implemented for other datastores

yasha-dev1 avatar Feb 07 '23 08:02 yasha-dev1

@yasha-dev1 I like the idea of filtering within vector stores! I can help with Qdrant implementation. I'm wondering how we could build some more complex queries (even thinking about simple AND/OR statements).

I'd say the VectorStoreFilter should be common for all the engines, so from the user perspective there is no need to choose a different class if they decide to switch to a different provider. I imagine the VectorStoreFilter works as some sort of tree-like structure to keep all the criteria, and then they got parsed into the engine-specific structure. Either by a particular VectorStore or even better by some sort of engine parser, but definitely not in the VectorStoreFilter directly.

kacperlukawski avatar Feb 17 '23 11:02 kacperlukawski

@kacperlukawski Yes, VectorStoreFilter should be a general interface for all vector stores. The best solution which comes to my mind is to implement ES-like Interface. for now only And operator is supported, but we can work on making it more general with OR operator. sth like ES bool query Collaborations are more than welcome, I can add you as collaborator

yasha-dev1 avatar Feb 17 '23 12:02 yasha-dev1

I too would like this functionality, is there any way I can help @yasha-dev1?

SirBernardPhilip avatar Mar 17 '23 14:03 SirBernardPhilip