chroma icon indicating copy to clipboard operation
chroma copied to clipboard

Adding lists to the metadata

Open Everminds opened this issue 2 years ago • 29 comments

Hi, We find ourselves having the need to save lists in the metadata (example, we are saving a slack message and want to have in the metadata all the users that are mentioned in the message) And we want the search to be able to filter by this field to see if some value is in the list (e.g. find me all slack messages that a specific user was mentioned in) It would be great to have support for this Thanks!

Everminds avatar Mar 23 '23 12:03 Everminds

@Everminds hello! Is this information that you store outside of chroma as well? If so, I have another idea for a solution here.

jeffchuber avatar Mar 24 '23 05:03 jeffchuber

We can save it outside though it would be less convenient

mangate avatar Mar 24 '23 05:03 mangate

@jeffchuber any updates on this one?

Everminds avatar Apr 02 '23 09:04 Everminds

I would vote for this, it will be very useful if it supports the list directly and we won't need 3rd tool to retrieve all the vectors and compare again.

It will be helpful for scenarios like we get a doc describing a thing but with different versions, models, etc.

8rV1n avatar May 19 '23 06:05 8rV1n

@8rV1n you want to be able to pass an allowlist of ids to query, right?

that is underway :) https://github.com/chroma-core/chroma/pull/384

jeffchuber avatar May 19 '23 06:05 jeffchuber

@8rV1n you want to be able to pass an allowlist of ids to query, right?

that is underway :) #384

Thanks @jeffchuber , I guess not just IDs, widening it to metadata would be great!

To clarify it:

  • Currently, we only support a Dict in setting up metadata values, I would expect we can also support list
  • With a list type of metadata value, maybe some operator like $contains, $range(AND OR could also do so) should be available for metadata.

I understand this would mean a lot of effort, but see below for how it helps:

An example scenario: Say I have a web page, but it is rapidly updating like weekly. The ID could be just some randomly generated UUIDs but it has a label illustrating the week number. So, if it supports the list, then we will be able to narrow down the range by filter like weeks 20-50.

Similarly, you may change the "web page" to "products" of an online shopping site, we normally filter things with many options like price, category, shipping preference, seller, etc. We want to get a similar result by the product detail(content), and we also want to filter it using things we are familiar with so that we can make it more efficient.

8rV1n avatar May 19 '23 08:05 8rV1n

@8rV1n chroma has this :) though we currently do a bad job communicating it

https://github.com/chroma-core/chroma/blob/a5637002e4599e8b9e78db8e7be0cdb380942673/chromadb/test/test_api.py#L1050

look inside that test folder and you will see examples of all of these. The where filter in get will work with query as well

jeffchuber avatar May 19 '23 17:05 jeffchuber

@8rV1n chroma has this :) though we currently do a bad job communicating it

https://github.com/chroma-core/chroma/blob/a5637002e4599e8b9e78db8e7be0cdb380942673/chromadb/test/test_api.py#L1050

look inside that test folder and you will see examples of all of these. The where filter in get will work with query as well

Thanks @jeffchuber!

Any idea for using metadata like this? (adding, and querying)

collection.add(
    documents=["Alice meets rabbits...", "doc2", "doc3", ...],
    metadatas=[{"charactor_roles": ['Alice', 'rabbits']}, {"charactor_roles": ['Steve Jobs', 'Tim Cook']}, {"charactor_roles": []}, ...],
    ids=["id1", "id2", "id3", ...]
)

It seems I can do this for the metadata when creating the collection:

client.create_collection(
    "my_collection", 
    metadata={"foo": ["bar", "bar2"]}
)

8rV1n avatar May 20 '23 03:05 8rV1n

+1, this would be incredibly useful for not needing a secondary datastore to just to be able to attach lists to documents

pbarker avatar Jun 22 '23 19:06 pbarker

Happy to take a stab at this.

If I'm understanding correctly, this would mean adding List as an allowed value in Metadata

-Metadata = Mapping[str, Union[str, int, float]]
+Metadata = Mapping[str, Union[str, int, float, List[Union[str, int, float]]]]

So that lists can be added as a value in metadata:

collection.add(ids=['test'], documents=['test'], metadatas=[{ 'list': [1, 2, 3] }])

The biggest source of uphill work, I think, would be adding support for Lists to the Where filter operators

EDIT: Should we re-use existing operators and make them them work for lists? e.g.

collection.get(where={ "list": {  "$eq": 2 } })

or create new operators for lists? e.g.

collection.get(where={"list": { "$contains": 2 } })

Russell-Pollari avatar Jun 30 '23 17:06 Russell-Pollari

@Russell-Pollari yes that is correct!

Where operator support is definitely the biggest lift here.

I think $in and $notin (or the better named version of those) is probably the minimal case...

jeffchuber avatar Jun 30 '23 20:06 jeffchuber

@jeffchuber

IMO $in and $nin imply that I should supply an array to filter against. They would be useful operators for all types.

I think it would be better UX to have $eq and $ne also work with lists (effectively as $contains or $notContains when appropriate)

But I'm definitely pattern matching to MongoDB's query operators here. This is how they do it:

I managed to get working prototype for filtering arrays with $eq for duckdb:

            # Shortcut for $eq
            if type(value) == str:
                result.append(
                    f""" (
                        json_extract_string(metadata, '$.{key}') = '{value}'
                        OR
                        json_contains(json_extract(metadata, '$.{key}'), '\"{value}\"')
                    )
                    """
                )
            if type(value) == int:
                result.append(
                    f""" (
                        CASE WHEN json_type(json_extract(metadata, '$.{key}')) = 'ARRAY'
                        THEN
                        list_has(CAST(json_extract(metadata, '$.{key}') AS INT[]), {value})
                        ELSE
                        CAST(json_extract(metadata, '$.{key}') AS INT) = {value}
                        END
                    )
                    """
                )
            if type(value) == float:
                result.append(
                    f""" (
                        CASE WHEN json_type(json_extract(metadata, '$.{key}')) = 'ARRAY'
                        THEN
                        list_has(CAST(json_extract(metadata, '$.{key}') AS DOUBLE[]), {value})
                        ELSE
                        CAST(json_extract(metadata, '$.{key}') AS DOUBLE) = {value}
                        END
                    )
                    """
                )

Russell-Pollari avatar Jun 30 '23 20:06 Russell-Pollari

@Russell-Pollari indexing against how mongo does it is definitely a good idea!

@HammadB what do you think?

jeffchuber avatar Jun 30 '23 21:06 jeffchuber

Threw up a PR, let me know what you think!

If my solution works for y'all, happy to also update the JS client and the docs

Russell-Pollari avatar Jul 03 '23 13:07 Russell-Pollari

@Russell-Pollari thanks! will take a look today :)

jeffchuber avatar Jul 04 '23 13:07 jeffchuber

Hey, I'm also interested in using this functionality, I have documents with a bunch of possible tags as metadata, for example

Document(page_content='lorem impsum ...',
metadata={
'id': '5f874c6591bc3f9a540c3722',
'title': 'hello world',
'tags': 'tag1, tag2, tag3, etc'
}
)

If I could use the $contains operator I could filter for specific tags. Right now I'm trying turning all the tags into binary values, but I think that's breaking chroma somehow

tyatabe avatar Jul 06 '23 09:07 tyatabe

but I think that's breaking chroma somehow

:( can you share more about what is breaking? this should work. are they true/false or 1/0?

jeffchuber avatar Jul 06 '23 15:07 jeffchuber

Hey, I wasn't sure it could handle booleans or ints, so I ended up turning them into strings '0'/'1'. The error I got was from clickhouse (I'm using with a chroma server), I think it was related to the size of the query being to big, as I also have a cloud server where I got a 413 error. I ended up looping over the documents and that solved the issue, so I'm guessing that having so many metadata fields makes the documents to big to be handled by clickhouse? (not really sure how it all works though)

tyatabe avatar Jul 07 '23 10:07 tyatabe

@tyatabe gotcha. there was a max_query_size issue people had run into with clickhouse. We are removing clickhouse now and that should fix up this sort of sharp edge.

jeffchuber avatar Jul 07 '23 14:07 jeffchuber

Exploring the new SQLite implementation.

My naive approach would look something like this, having tables for int str and float

     def _insert_metadata(self, cur: Cursor, id: int, metadata: UpdateMetadata) -> None:
         """Insert or update each metadata row for a single embedding record"""
-        t = Table("embedding_metadata")
+        t, str_list, int_list, float_list = Tables(
+            "embedding_metadata",
+            "embedding_metadata_string",
+            "embedding_metadata_int",
+            "embedding_metadata_float",
+        )
         q = (
             self._db.querybuilder()
             .into(t)
             .columns(t.id, t.key, t.string_value, t.int_value, t.float_value)
         )
         for key, value in metadata.items():
+            if isinstance(value, list):
+                if isinstance(value[0], str):
+                    for val in value:
+                        q_str = (
+                            self._db.querybuilder()
+                            .into(str_list)
+                            .columns(str_list.metadata_id, str_list.value)
+                            .insert(ParameterValue(id), ParameterValue(val))
+                        )
+                if isinstance(value[0], int):
+                    for val in value:
+                        q_int = (
+                            self._db.querybuilder()
+                            .into(int_list)
+                            .columns(int_list.metadata_id, int_list.value)
+                            .insert(ParameterValue(id), ParameterValue(val))
+                        )
+                if isinstance(value[0], float):
+                    for val in value:
+                        q_float = (
+                            self._db.querybuilder()
+                            .into(float_list)
+                            .columns(float_list.metadata_id, float_list.value)
+                            .insert(ParameterValue(id), ParameterValue(val))
+                        )
             if isinstance(value, str):
                ...
                 q = q.insert(
                     ParameterValue(id),

Does this make sense? @jeffchuber @HammadB

Russell-Pollari avatar Jul 12 '23 01:07 Russell-Pollari

Update: got a hacky prototype for list[int]. Should be straightforward to generalize to other types

(branched off of https://github.com/chroma-core/chroma/pull/781 for my working dir)

Migration for new table:

CREATE TABLE embedding_metadata_ints (
    id INTEGER REFERENCES embeddings(id),
    key TEXT REFERENCES embedding_metadata(key),
    int_value INTEGER NOT NULL
);

Inserting metadata with list chromadb/segment/impl/metadata/sqlite.py

    def _insert_metadata(self, cur: Cursor, id: int, metadata: UpdateMetadata) -> None:
        """Insert or update each metadata row for a single embedding record"""
        (
            t,
            int_t,
        ) = Tables(
            "embedding_metadata",
            "embedding_metadata_ints",
        )
        q = (
            self._db.querybuilder()
            .into(t)
            .columns(t.id, t.key, t.string_value, t.int_value, t.float_value)
        )
        for key, value in metadata.items():
            if isinstance(value, list):
                q = q.insert(
                    ParameterValue(id),
                    ParameterValue(key),
                    None,
                    None,
                    None,
                )
                if isinstance(value[0], int):
                    q_int = (
                        self._db.querybuilder()
                        .into(int_t)
                        .columns(int_t.id, int_t.key, int_t.int_value)
                    )
                    for val in value:
                        q_int = q_int.insert(
                            ParameterValue(id), ParameterValue(key), ParameterValue(val)
                        )
                    sql, params = get_sql(q_int)
                    sql = sql.replace("INSERT", "INSERT OR REPLACE")
                    if sql:
                        cur.execute(sql, params)

            if isinstance(value, str):
             ...

Querying for list of ints (SqliteMetadataSegment.get_metadata)

    def get_metadata
....
        embeddings_t, metadata_t, fulltext_t, int_t = Tables(
            "embeddings",
            "embedding_metadata",
            "embedding_fulltext",
            "embedding_metadata_ints",
        )

        q = (
            (
                self._db.querybuilder()
                .from_(embeddings_t)
                .left_join(metadata_t)
                .on(embeddings_t.id == metadata_t.id)
                .outer_join(int_t)
                .on((metadata_t.key == int_t.key) & (metadata_t.id == int_t.id))
            )
            .select(
                embeddings_t.id,
                embeddings_t.embedding_id,
                embeddings_t.seq_id,
                metadata_t.key,
                metadata_t.string_value,
                metadata_t.int_value,
                metadata_t.float_value,
                int_t.int_value,
            )

constructing metadata object with list of ints

    def _record(self, rows: Sequence[Tuple[Any, ...]]) -> MetadataEmbeddingRecord:
        """Given a list of DB rows with the same ID, construct a
        MetadataEmbeddingRecord"""
        _, embedding_id, seq_id = rows[0][:3]
        metadata = {}
        for row in rows:
            key, string_value, int_value, float_value, int_elem = row[3:]
            if string_value is not None:
                metadata[key] = string_value
            elif int_value is not None:
                metadata[key] = int_value
            elif float_value is not None:
                metadata[key] = float_value
            elif int_elem is not None:
                int_list = metadata.get(key, [])
                int_list.append(int_elem)
                metadata[key] = int_list

Also requires updating the relevant types/validators to allow for lists

Russell-Pollari avatar Jul 13 '23 15:07 Russell-Pollari

Converging on a solution

Initially, I created tables for each allowed list type (int, str, float). It was working but was getting messy.

Ended up using another table with the same schema as embedding_metadata, which let me reuse a lot of existing functions

CREATE TABLE embedding_metadata_lists (
    id INTEGER REFERENCES embeddings(id),
    key TEXT REFERENCES embedding_metadata(key),
    string_value TEXT,
    float_value REAL,
    int_value INTEGER
);
    @override
    def get_metadata(
        self,
        where: Optional[Where] = None,
        where_document: Optional[WhereDocument] = None,
        ids: Optional[Sequence[str]] = None,
        limit: Optional[int] = None,
        offset: Optional[int] = None,
    ) -> Sequence[MetadataEmbeddingRecord]:
        """Query for embedding metadata."""

        embeddings_t, metadata_t, fulltext_t, metadata_list_t = Tables(
            "embeddings",
            "embedding_metadata",
            "embedding_fulltext",
            "embedding_metadata_lists",
        )

        q = (
            (
                self._db.querybuilder()
                .from_(embeddings_t)
                .left_join(metadata_t)
                .on(embeddings_t.id == metadata_t.id)
                .left_outer_join(metadata_list_t)
                .on(
                    (metadata_t.key == metadata_list_t.key)
                    & (embeddings_t.id == metadata_list_t.id)
                )
            )
            .select(
                embeddings_t.id,
                embeddings_t.embedding_id,
                embeddings_t.seq_id,
                metadata_t.key,
                metadata_t.string_value,
                metadata_t.int_value,
                metadata_t.float_value,
                metadata_list_t.string_value,
                metadata_list_t.int_value,
                metadata_list_t.float_value,
            )
            ...

If this approach makes sense, can you assign this issue to me, @jeffchuber? I just about have a shippable PR with tests (old and new) passing.

Russell-Pollari avatar Jul 14 '23 01:07 Russell-Pollari

Hi @Russell-Pollari , can you explain how those changes will impac the usage of the chorma from a user point of view?

My use case is the following: Each item in the database is tagged using the appropriate key (in my case it's "tags"). I would like to pre-filter the query results based alson on the tags. Let's say we have 3 documents: the first has tags = [iot, business, machine] the second has tags = [iot, business, support] the third has tags = [iot]

I would like to pre-filter the data getting only the items that for example have "iot" and "business" as tags.

Using the already present syntax (using-logical-operators) it could be something like this:

where={
     "$and": [
         {
             "tags": {
                 $contains: "iot"
             }
         },
         {
             "tags": {
                 $contains: "business"
             }
         }
     ]
}

The same apply for &or operetor.

Buckler89 avatar Jul 15 '23 18:07 Buckler89

@Buckler89 That's the intended use case for this feature! Supporting lists to embed metadata, and allow uses to filter based on those lists. I have a working local branch implementing this.

I'll likely push a PR this week once the Chroma team merges their big SQLite refactor.

Russell-Pollari avatar Jul 17 '23 12:07 Russell-Pollari

needs to integrate fairly tightly with the need to create custom indices...

jeffchuber avatar Sep 13 '23 21:09 jeffchuber

Dear all, this issue came back in python 0.4.20. @jeffchuber

collection.add(
    documents=[x["metadata"]["summary"] for x in data],
    embeddings=embeds_2.embeddings,
    metadatas=[x['metadata'] for x in data],
     ids=[x['uid'] for x in data]
)

where data is a list of object, each object is like this:

{
        "uid": string,
        "field1": string,
        "field2": string[],
        "metadata": {
            "field1": string[],
            "field2": number[],
            "field4": string,
        }
    },

The error is:

ValueError                                Traceback (most recent call last)
Cell In[107], [line 1](vscode-notebook-cell:?execution_count=107&line=1)
----> [1](vscode-notebook-cell:?execution_count=107&line=1) collection.add(
      [2](vscode-notebook-cell:?execution_count=107&line=2)     documents=[x["metadata"]["summary"] for x in data],
      [3](vscode-notebook-cell:?execution_count=107&line=3)     embeddings=embeds_2.embeddings,
      [4](vscode-notebook-cell:?execution_count=107&line=4)     metadatas=[x['metadata'] for x in data],
      [5](vscode-notebook-cell:?execution_count=107&line=5)      ids=[x['uid'] for x in data]
      [6](vscode-notebook-cell:?execution_count=107&line=6) )

File [d:\dev2.0\deep-processing\.venv\Lib\site-packages\chromadb\api\models\Collection.py:146](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:146), in Collection.add(self, ids, embeddings, metadatas, documents, images, uris)
    [104](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:104) def add(
    [105](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:105)     self,
    [106](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:106)     ids: OneOrMany[ID],
   (...)
    [116](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:116)     uris: Optional[OneOrMany[URI]] = None,
    [117](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:117) ) -> None:
    [118](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:118)     """Add embeddings to the data store.
    [119](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:119)     Args:
    [120](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:120)         ids: The ids of the embeddings you wish to add
   (...)
    [136](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:136) 
    [137](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:137)     """
    [139](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:139)     (
    [140](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:140)         ids,
...
    [277](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/types.py:277)             f"Expected metadata value to be a str, int, float or bool, got {value} which is a {type(value)}"
    [278](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/types.py:278)         )
    [279](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/types.py:279) return metadata

ValueError: Expected metadata value to be a str, int, float or bool, got ['901123200'] which is a <class 'list'>

PeterTF656 avatar Dec 20 '23 07:12 PeterTF656

Is this still on the roadmap? I'm trying to add a collection of "keywords" for each article I am storing and this seems like it'd be needed for that (I could also be architecturing this wrong myself...)

ivanol55 avatar Feb 22 '24 09:02 ivanol55