langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Implementation for Matching Engine Vectorstore

Open tomaspiaggio opened this issue 2 years ago • 6 comments

We just finished the implementation for the vector store using the GCP Matching Engine.

We'll be contributing the implementation.

Related to #2892

If you have any questions or suggestions please contact me (@tomaspiaggio) or @scafati98.

tomaspiaggio avatar Apr 18 '23 18:04 tomaspiaggio

I just pushed a new updates addressing the comments. However, we were trying to add google-cloud-storage and google-cloud-aiplatform to the pyproject.toml but we're having dependency conflicts with black. Do you have any suggetions here? @dev2049

tomaspiaggio avatar Apr 19 '23 15:04 tomaspiaggio

I just pushed a new updates addressing the comments. However, we were trying to add google-cloud-storage and google-cloud-aiplatform to the pyproject.toml but we're having dependency conflicts with black. Do you have any suggetions here? @dev2049

black is only a linting dependency, not a package dependency, so shouldn't cause issues. think you may have accidentally added it to list of actual dependencies

dev2049 avatar Apr 19 '23 16:04 dev2049

@hwchase17 I thought that was addressed with the from_components function. Would you comment specifically what would you need? I'm also not sure what you mean by arguments being passed around as well. Would you please comment on that as well so I can fix it? Thank you!

tomaspiaggio avatar Apr 20 '23 13:04 tomaspiaggio

@hwchase17 I thought that was addressed with the from_components function. Would you comment specifically what would you need? I'm also not sure what you mean by arguments being passed around as well. Would you please comment on that as well so I can fix it? Thank you!

think he means to make __init__ look something like what i mentioned here https://github.com/hwchase17/langchain/pull/3104/files#r1170476602

dev2049 avatar Apr 20 '23 17:04 dev2049

@dev2049 I already added the from_components function and I agree it is a better approach. The methods called in the constructor are validations for the gcs_bucket_name and that the client libraries are installed. I'm sorry if I'm not understanding what you mean.

tomaspiaggio avatar Apr 20 '23 17:04 tomaspiaggio

@dev2049 I already added the from_components function and I agree it is a better approach. The methods called in the constructor are validations for the gcs_bucket_name and that the client libraries are installed. I'm sorry if I'm not understanding what you mean.

i just meant you should update __init__ params, which it looks like you did in https://github.com/hwchase17/langchain/pull/3104/commits/2f946f548d502958b97143719e1b36da6f01b05a 🙏 !

dev2049 avatar Apr 21 '23 01:04 dev2049

Great @dev2049 !! So do you need me to do anything else for the merge?

tomaspiaggio avatar Apr 22 '23 01:04 tomaspiaggio

@hwchase17 any chance to get this into release anytime soon?

meal avatar May 07 '23 13:05 meal

@hwchase17 Same question here: Would be nice to see this released

eugenemiretsky avatar May 21 '23 18:05 eugenemiretsky

One concern is that the docs are stored/retrieve from GCS which is slow (and somewhat defeats the purpose of using a Vector DB)

eugenemiretsky avatar May 21 '23 18:05 eugenemiretsky

@tomaspiaggio should you create a PR your branch to master?

eugenemiretsky avatar May 24 '23 11:05 eugenemiretsky

@hwchase17 Any updates on this one? Would be a cool feature!

olaf-hoops avatar May 25 '23 19:05 olaf-hoops

Will this be merged to master? @hwchase17

tomaspiaggio avatar May 30 '23 15:05 tomaspiaggio

Keen to get this merged into master @hwchase17

HarrisonKhannah avatar May 31 '23 03:05 HarrisonKhannah

Once we have Matching engine index is deployed, What is the best retriever on langchain to get the query results ? @tomaspiaggio

ramssai avatar Jun 01 '23 10:06 ramssai

Have been using the Vector Search (Matching Engine) with langchain for a couple of days now and I've been hitting my head against a wall to solve a problem.

I notice that when embeddings are sent to Vector Search they get stored and a file is also created and stored within a separate GCS bucket that is referenced when queried.

I am looking for a way to remove the embeddings from the Vector Search but it seems I can only do it with gcloud commands but I need to know the datapoint_ids.

What would be the best way to store the datapoint_ids that are related to the documents that are being embedded?

ktibbs9417 avatar Nov 30 '23 23:11 ktibbs9417