chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Bug]: Metadata/`where` edge cases

Open hesreallyhim opened this issue 6 months ago • 9 comments

What happened?

Related: #4346

Description

In addition to the observation in #4346 that document ID can be empty string, I found some edge cases that we may wish to disallow:

collection.upsert(
    embeddings=[
        [1.1, 2.3, 3.2],
        [4.5, 6.9, 4.4],
        [1.1, 2.3, 3.2],
        [4.5, 6.9, 4.4],
        [1.7, 4.3, 3.2],
        [4.7, 4.8, 3.2],
    ],
    metadatas=[
        {"uri": "img1.png", "style": "style1"},
        {"": "img2.png"},
        {"": "", "$nin": "uhoh"},
        {"uri": "img4.png", "computed": "style" + "1"},
        {"uri": "img5.png", "uhoh": "$contains"},
        {"uri": "$contains", "$nin": "$nin", "bool": True, "num": 21},

    ],
    documents=["doc9", "doc2", "doc3", "doc4", "doc5", "doc6"],
    ids=["id1", "id2", "id3", "id4", "id5", "id6"],
)

result1 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"style": {"$eq": "style1"}},
    )
print("result1", result1)

result2 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"$nin": "uhoh"}
    )

print("result2", result2)

result3 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"uhoh": "$contains"}
    )

print("result3", result3)

result4 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"$nin": "$nin"}
    )

print("result4", result4)

result5 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"bool": 5 == 5}
    )

print("result5", result5)

result6 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"bool": True or False}
    )

print("result6", result6)

result7 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"bool": False or 0 or True or 6} # chain of falsey and then match True
    )

print("result7", result7)

result8 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"bool": False or "truthy" or True or 6} # chain with a non-matching truthy, cuts off the real match
    )

print("result8", result8)

result9 = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=15,
        where={"num": {"$in": list(range(25))}}
)

print("result9", result9)

# result10 = collection.query(
#         query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
#         n_results=15,
#         where={"num": {"$in": list(range(10000000000000000000))}} # DoS attack(?)
# )

# print("result10", result10)

View in this colab notebook:

https://colab.research.google.com/drive/1BKGRLM9CmuGHHFW0hBorlLN6g-Coz2U1#scrollTo=64dWyeEdKAX9

The last one is maybe a potential DoS attack for Chroma Cloud(??)

I think this also means that ID filtering (new feature) will already allow for "operations" ("$gte", etc.) since you can do a lot with list comprehension, and basically simulate the same functionality.

Versions

Chroma 1.0.7 python 3.11.12

Relevant log output


hesreallyhim avatar Apr 25 '25 21:04 hesreallyhim