langchain icon indicating copy to clipboard operation
langchain copied to clipboard

[Feature] Redis Vectorestore - similarity_search filter by metadata

Open benjaminmay1 opened this issue 1 year ago • 5 comments

Chroma or Pinecone Vector databases allow filtering documents by metadata with the filter parameter in the similarity_search function but the similarity_search does not have this parameter. Would it be possible to enable similarity_search for Redis Vector store?

benjaminmay1 avatar May 02 '23 11:05 benjaminmay1

This is something we are exploring. It's 100% possible with redis to do all kinds of metadata filtering, but because the RediSearch module expects the developer to define a schema explicitly, we're trying to figure out the best way to do that. How would you expect the dev UX to feel from your perspective?

tylerhutcherson avatar May 03 '23 13:05 tylerhutcherson

It would be nice to have the option to filter for documents from a list of document names but also for more complex metadata filtering. Also having a similar interface for metadata filtering for different Vectorstores to make Vectorstores easily exchangeable would be interesting. Thank you.

benjaminmay1 avatar May 03 '23 14:05 benjaminmay1

@hwchase17 @agola11 this is probably a good time to get input from the different vector store providers and try to standardize the filtering interface. We all have different approaches, some more complex/sophisticated than others. Some are "schemaless", while some require defining a schema. Some have special query syntax, and some just have a dictionary as input. But having a set of classes/methods that the different stores can wrap would be helpful!

tylerhutcherson avatar May 05 '23 14:05 tylerhutcherson

I'm happy to help here as well. I just added "make PR for redis vectorstore pre-filtering" on my to do list and then found this issue. I was planning to focus specifically on redis, b/c that's my selfish need.

But if the plan is to first come up with a more generic set of classes/methods before implementing flavor-specific methods, I'm happy to help divide and conquer in terms of research/looking for common denominators.

@tylerhutcherson

RediSearch module expects the developer to define a schema explicitly

You're referring to the schema defined at the index level, correct?

For instance in langchain.vectorstores.redis:

        if not _check_index_exists(self.client, self.index_name):
            # Constants
            distance_metric = (
                "COSINE"  # distance metric for the vectors (ex. COSINE, IP, L2)
            )
            schema = (
                TextField(name=self.content_key),
                TextField(name=self.metadata_key),
                VectorField(
                    self.vector_key,
                    "FLAT",
                    {
                        "TYPE": "FLOAT32",
                        "DIM": dim,
                        "DISTANCE_METRIC": distance_metric,
                    },
                ),
            )
            prefix = _redis_prefix(self.index_name)

            # Create Redis Index
            self.client.ft(self.index_name).create_index(
                fields=schema,
                definition=IndexDefinition(prefix=[prefix], index_type=IndexType.HASH),
            )

Off the top of my head, both FAISS and Pinecone allow for/expect the metadata be a single JSON object with key-value pairs where the allowed types vary by vectorstore. However, it looks like the Redis approach is a flatter schema, and only Text, Numeric, and Tag fields are supported. Is that correct?

Per @benjaminmay1's comment

Also having a similar interface for metadata filtering for different Vectorstores to make Vectorstores easily exchangeable would be interesting

As far as interchangeability is concerned, not all vectorstores implement the same methods, so it's not always possible to swap them out. Sometimes this is because of a missing/incomplete implementation in langchain, other times there may be a fundamental feature missing in the VS itself. Achieving maximal exchangeability by minimizing cases of the former is desirable over the long-term. However, short of that, simply having more explicit types for which vectorstores can be slotted into certain classes would be a great start rather than using a generic VectorStore type as a lazy catch all.

AFAICT, even implementing a common interface for metadata filtering for all vectorstore still wouldn't make them universally interchangeable across the library. So the above type recommendation should be thought of a as separate issue.

EandrewJones avatar May 08 '23 03:05 EandrewJones

Hello guys, is there any recent update on this topic?

federicoBetti avatar Jun 14 '23 09:06 federicoBetti

+1 for this, it is one of the issue that blocking us with using Redis as vectorstore. For now we decided to use Pinecone but we considering switching to Redis once this is implemented as we already use Redis for other use cases (tech stack unification) :)

sebarys avatar Jul 04 '23 15:07 sebarys

@EandrewJones redis allows for nested schema using the Redis JSON module, but this implementation in LC uses hashes to store underlying data in Redis (flat structure, and also more performance minded).

Let's make moves on this, I'm willing to carve out some time to help here.

  1. Sounds like a universal approach is not feasible and also maybe not that helpful in the end. At least trying to solve for that now sounds like a big blocker.

  2. Schema definition and management seem like the biggest challenges. We need a way to define the fields to index in Redis upon index creation, which is a separate step from writing the data itself.

I think hiding as much complexity from the user as possible is the most desirable route, especially since users are leveraging LC here for speed and simplicity of LLM dev.

tylerhutcherson avatar Jul 04 '23 18:07 tylerhutcherson

@tylerhutcherson

Apologies for the hold up -- been swamped between jobs. We ultimately went with pinecone because we couldn't wait for Redis to be compatible with our needs.

Nonetheless, I think I can scrounge some time over the next few weeks to help move this forward.

I have an idea for a simple API interface to allow people to define the schema metadata fields at index creation. I'll share more when my brain isn't fried. TL;DR we can allow them to define metadata fields as a dict:

{'<field_name>': '<redis_data_type>'}

So it feels similar to other vector DB schema approaches, then within the index creation we can iterate over/unpack that dict into the schema tuple redis expects.

Semi-related question: Besides performance, is there a strong reason to use hashes over redis JSON? Is it a serious performance hit to use redis JSON? The reason I ask is because the flexibility of redis JSON means it could make the LC redis vectorDB class hew towards "a universal approach". I'm not proposing this as a refactor for now, but it could be worth mulling over?

EandrewJones avatar Jul 13 '23 03:07 EandrewJones

interesting to note: langchain supports this for js/supabase, not python yet:

https://js.langchain.com/docs/modules/indexes/vector_stores/integrations/supabase

earonesty avatar Jul 14 '23 17:07 earonesty

Looks like a handful already have the ability to use metafiltering in js: Chroma, HNSWLib, MyScale, Prisma, Redis, SingleStore, Tigris, Typesense, Vectara

jlancaster7 avatar Jul 17 '23 12:07 jlancaster7

If someone has not already started to work on this, I can give it a shot this week. Kindly let know

nareshr8 avatar Jul 28 '23 05:07 nareshr8

So I think my idea will work for adding more than one metadata field to the schema, but I'd like to minimize breaking changes.

Right now, the class supports a single metadata field that's of type text. Let's assume people are using this in production. Any change to make this generic (support more than TextField) and plural (let's the user define N metadata fields), would be inherently breaking.

Currently, it looks as though the class does support a metadata dict but it serializes to a string:

...
 for i, text in enumerate(texts):
            # Use provided values by default or fallback
            key = keys_or_ids[i] if keys_or_ids else _redis_key(prefix)
            metadata = metadatas[i] if metadatas else {}
            embedding = embeddings[i] if embeddings else self.embedding_function(text)
            pipeline.hset(
                key,
                mapping={
                    self.content_key: text,
                    self.vector_key: np.array(embedding, dtype=np.float32).tobytes(),
                    self.metadata_key: json.dumps(metadata),
                },
            )
            ids.append(key)

            # Write batch
            if i % batch_size == 0:
                pipeline.execute()
...

The core issue we need to solve here is that we cannot filter on individual fields within that serialized dict. The only solution I see involved flattening the metadata into their own fields so we can filter on them in the query. I'm a little fuzzy here though. Would be great to have @tylerhutcherson's insight on the correct and most efficient way to add the filter into the query.

EandrewJones avatar Jul 29 '23 04:07 EandrewJones

I am also thinking of the same. Flattening the metadata. Add all metadata fields to the index so that it can be filtered. I have a dirty implementation of that in my local. It worked for filtering as well.

nareshr8 avatar Jul 29 '23 04:07 nareshr8

honestly there should not be fixed column definitions or structures, imo. instead the user should be able to trivially override the similarity matching and insertion functions, so that any backend of any shape can be used

but for now, a metadata filter is fine

On Sat, Jul 29, 2023, 12:50 AM Naresh @.***> wrote:

I am also thinking of the same. Flattening the metadata. Add all metadata fields to the index so that it can be filtered. I have a dirty implementation of that in my local. It worked for filtering as well.

— Reply to this email directly, view it on GitHub https://github.com/langchain-ai/langchain/issues/3967#issuecomment-1656555237, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMUOPW4Z6TY3JIM2GLIDXSSJCFANCNFSM6AAAAAAXS4PHEE . You are receiving this because you commented.Message ID: @.***>

earonesty avatar Jul 29 '23 06:07 earonesty

@nareshr8 If you push your "dirty" implementation, I'm happy to help clean it up.

@earonesty While I agree, I'm not sure any redis SDK supports that degree of flexibility. Tyler mentioned that redisJSON + redisSearch can support nested metadata fields, but it still expects you to define an index schema on the elements within that JSON object if you want to filter on them:

image

Do you have some insights about how this could be achieved?

EandrewJones avatar Jul 29 '23 16:07 EandrewJones

Sure. Will do it. Its night now. Will do it tonight or early tomorrow.

nareshr8 avatar Jul 29 '23 16:07 nareshr8

NP. Much appreciated.

Best

Evan Jones Website: www.ea-jones.com

On Sat, Jul 29, 2023 at 12:19 PM Naresh @.***> wrote:

Sure. Will do it. Its night now. Will do it tonight or early tomorrow.

— Reply to this email directly, view it on GitHub https://github.com/langchain-ai/langchain/issues/3967#issuecomment-1656767999, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ2T6AJNKMHINZGNKHL4QPLXSUZYHANCNFSM6AAAAAAXS4PHEE . You are receiving this because you were mentioned.Message ID: @.***>

EandrewJones avatar Jul 29 '23 16:07 EandrewJones

Created a PR: https://github.com/langchain-ai/langchain/pull/8464

I have done one round of testing. It looked fine.

nareshr8 avatar Jul 29 '23 18:07 nareshr8

Also I saw that we are filtering the documents based on score_threshold in python. Its possible to do in redis itself. Should we adding the threshold to be part of redis query itself.

nareshr8 avatar Jul 29 '23 18:07 nareshr8

Also I saw that we are filtering the documents based on score_threshold in python. Its possible to do in redis itself. Should we adding the threshold to be part of redis query itself.

Yes - Range Queries are supported in Redis. However, the initial interface uses the same underlying query approaching using KNN/ANN. But this could be adjusted in the future

tylerhutcherson avatar Jul 30 '23 21:07 tylerhutcherson

@nareshr8 If you push your "dirty" implementation, I'm happy to help clean it up.

@earonesty While I agree, I'm not sure any redis SDK supports that degree of flexibility. Tyler mentioned that redisJSON + redisSearch can support nested metadata fields, but it still expects you to define an index schema on the elements within that JSON object if you want to filter on them:

image

Do you have some insights about how this could be achieved? Hey all -- a few thoughts:

  1. The thoughts above above are valid - Redis requires the user-defined schema. There's a few reasons for this. Redis is not just a vector database, and in fact many folks use it in tandem with an online feature serving layer -- including attributes/fields that do NOT need to be indexed. Indexing fields that don't need to be wastes precious RAM. So while allowing the user to decide what is best for their application, this does add more frustration to UX. But hopefully we can make the LangChain interface somewhat delightful.

  2. My proposition is we create a simple and standard way to define a "Schema" using a python dictionary object that is passed upon index creation. To enable robustness -- we could also persist this schema object in Redis itself, such that when we call from_existing we can read this from the db on load. Example:

{
    "field_name_1": {
        "type": "Text",
        "description": "This is a text field",
        "default": "N/A"
    },
    "field_name_2": {
        "type": "Tag",
        "description": "This is a tag field",
    },
    "field_name_3": {
        "type": "Numeric",
        "description": "This is a numeric field",
        "default": 0
    },
    "field_name_4": {
        "type": "Vector",
        "description": "This is a vector field",
    },
    # ...
}

tylerhutcherson avatar Jul 30 '23 21:07 tylerhutcherson

Tyler's suggestion hews closely to what I was thinking -- the store schema in DB is a nice touch.

Naresh, I took a look at your implementation. It's pretty close to the above, we just need to add more flexibility so that the metadata keys aren't assumed to always be Text. As Tyler shows, we also ought to support Numeric and Tag. I'll try to carve out some time tomorrow to comment directly on the code (if I even have those permissions) or push some changes.

Best

Evan Jones Website: www.ea-jones.com

On Sun, Jul 30, 2023 at 5:25 PM Tyler Hutcherson @.***> wrote:

@nareshr8 https://github.com/nareshr8 If you push your "dirty" implementation, I'm happy to help clean it up.

@earonesty https://github.com/earonesty While I agree, I'm not sure any redis SDK supports that degree of flexibility. Tyler mentioned that redisJSON + redisSearch can support nested metadata fields, but it still expects you to define an index schema on the elements within that JSON object if you want to filter on them:

[image: image] https://user-images.githubusercontent.com/41238273/256993629-788a3f7b-310b-4859-b602-8a8028758929.png

Do you have some insights about how this could be achieved? Hey all -- a few thoughts:

The thoughts above above are valid - Redis requires the user-defined schema. There's a few reasons for this. Redis is not just a vector database, and in fact many folks use it in tandem with an online feature serving layer -- including attributes/fields that do NOT need to be indexed. Indexing fields that don't need to be wastes precious RAM. So while allowing the user to decide what is best for their application, this does add more frustration to UX. But hopefully we can make the LangChain interface somewhat delightful. 2.

My proposition is we create a simple and standard way to define a "Schema" using a python dictionary object that is passed upon index creation. To enable robustness -- we could also persist this schema object in Redis itself, such that when we call from_existing we can read this from the db on load. Example:

{ "field_name_1": { "type": "Text", "description": "This is a text field", "default": "N/A" }, "field_name_2": { "type": "Tag", "description": "This is a tag field", }, "field_name_3": { "type": "Numeric", "description": "This is a numeric field", "default": 0 }, "field_name_4": { "type": "Vector", "description": "This is a vector field", }, # ... }

— Reply to this email directly, view it on GitHub https://github.com/langchain-ai/langchain/issues/3967#issuecomment-1657270097, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ2T6AKOTIV3ZW2QN7VQDCLXS3GODANCNFSM6AAAAAAXS4PHEE . You are receiving this because you were mentioned.Message ID: @.***>

EandrewJones avatar Jul 31 '23 03:07 EandrewJones

Ill also work on this. Guessing you can comment there as well. Let know otherwise.

nareshr8 avatar Jul 31 '23 04:07 nareshr8

@nareshr8 If you push your "dirty" implementation, I'm happy to help clean it up.

@earonesty While I agree, I'm not sure any redis SDK supports that degree of flexibility. Tyler mentioned that redisJSON + redisSearch can support nested metadata fields, but it still expects you to define an index schema on the elements within that JSON object if you want to filter on them:

image

Do you have some insights about how this could be achieved?

i think you can achieve it by deriving from the Redis class and implementing your own matching functionality? But you have to make sure your init is the same, because langchain likes to instance stuff for you

earonesty avatar Jul 31 '23 12:07 earonesty

I have implemented the changes to accommodate TEXT, TAG and NUMERIC in the above mentioned PR. Let know your suggestions

nareshr8 avatar Jul 31 '23 13:07 nareshr8

Added in #8612

Spartee avatar Sep 12 '23 23:09 Spartee

@baskaryan this can be closed.

Spartee avatar Oct 17 '23 22:10 Spartee