chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[ENH][TST] more advanced metadata string operations filtering with $like and $nlike operators

Open CrosswaveOmega opened this issue 1 year ago • 23 comments

More advanced string filtering for Metadata with $like and $nlike operators.

I needed to be able to filter results based on if some metadata fields contained a certain substring, however chroma's string filtering for metadata was too limited.

Therefore, I modified the library to allow for usable $like and $nlike operators that can be used with any Collection's get,query,and delete methods.

Description of changes

Added $like and $nlike operands for where kwarg filtering

  • Improvements & Bug fixes
    • ...
  • New functionality
    • The ability to include the "$like" and "$nlike" operators with the where kwarg.
    • Typing for these two new operators has been added to chromadb/types.py
    • These operands are meant to be used with "%" and "_" wildcard operators.
    • operands throw an error if you attempt to use them with a non string value.
    • added new tests for evaluating the "$like" operator and "$nlike" operators.

Example usage

collection=chromaclient.get_collection('example_collection')
results=collection.get(where={"string_value": {"$like": "%example%"}})

Test plan

How are these changes tested?

  • [x] Tests pass locally with pytest for python, yarn test for js

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

Mild documentation changes will be required.

CrosswaveOmega avatar Sep 29 '23 16:09 CrosswaveOmega

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • [ ] Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • [ ] Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • [ ] If appropriate, are there adequate property based tests?
  • [ ] If appropriate, are there adequate unit tests?
  • [ ] Should any logging, debugging, tracing information be added or removed?
  • [ ] Are error messages user-friendly?
  • [ ] Have all documentation changes needed been made?
  • [ ] Have all non-obvious changes been commented?

System Compatibility

  • [ ] Are there any potential impacts on other parts of the system or backward compatibility?
  • [ ] Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • [ ] Is this code of a unexpectedly high quality (Readbility, Modularity, Intuitiveness)

github-actions[bot] avatar Sep 29 '23 16:09 github-actions[bot]

@CrosswaveOmega, this is awesome. Are you familiar with the hypothesis framework? The reason for asking is that, ideally, we want property tests that verify various (statistical) properties of querying with filters. Have a look here:

https://github.com/chroma-core/chroma/pull/1081/files#diff-403dd784c71c55196b6483bc3977c92f24a327a20b3d73429c1ba9a99ef9ef43

  • chromadb/test/property/strategies.py
  • chromadb/test/property/test_filtering.py

tazarov avatar Sep 29 '23 17:09 tazarov

@tazarov I am not familiar with the hypothesis framework, though I can learn it if needed.

What statistical properties would need to be verified in the case of the like operator?

CrosswaveOmega avatar Sep 29 '23 18:09 CrosswaveOmega

I think it may be nice to match the $contains syntax of the where_document clause. And @tazarov had been working on supporting $not.

My main concern here is that without indexes this can be quite slow.

HammadB avatar Sep 29 '23 18:09 HammadB

Initially, the plan was to simply match the $contains syntax from the where_document clause, until I noticed that one of the tests for validation was for ensuring that $contains couldn't be used in this context. Which is why I chose to use $like and $nlike instead, since I could use wildcard characters that way.

I actually haven't noticed any index related speed issues yet while testing, though I'm pretty sure that's because my sqlite version has automatic indexing on by default. If need be, a special metadata index creation method for specific keys could be implemented.

CrosswaveOmega avatar Sep 29 '23 20:09 CrosswaveOmega

I reviewed the hypothesis framework and implemented a variant of the metadata filtering tests for $like/$nlike.

Because $like is filtering based on a substring, additional metadata keys are added to the known_metadata_keys, and are populated with random combinations of the known_document_keywords (as to not add yet another set of known keywords to the collection test object.)

CrosswaveOmega avatar Oct 05 '23 19:10 CrosswaveOmega

ensuring that this PR remains compatible with the changes in main.

CrosswaveOmega avatar Oct 29 '23 20:10 CrosswaveOmega

Really looking forward to being able to use $like $nlike filtering. Hope this gets merged soon!

gururise avatar Nov 27 '23 18:11 gururise

@CrosswaveOmega just checking in if you plan to continue here :)

jeffchuber avatar Dec 19 '23 23:12 jeffchuber

Implemented the nitpicks brought up by @tazarov after consideration.

@jeffchuber I do plan on continuing here.

CrosswaveOmega avatar Jan 11 '24 21:01 CrosswaveOmega

I'm not sure what else should be done for this pull request.

Is there anything else in particular I should focus on before a final review & merge?

CrosswaveOmega avatar Jan 26 '24 16:01 CrosswaveOmega

If there is anything I can do to help reviewing/merging this please PR let me know.

pevogam avatar Jan 31 '24 01:01 pevogam

Waiting for this to be merged

mim201820 avatar Feb 02 '24 10:02 mim201820

Any chance of getting this merged? The lack of a '$like' filtering operator in default Chroma is really holding us back. @jeffchuber

gururise avatar Feb 14 '24 14:02 gururise

I've updated the tests based on the feedback by @tazarov, including the additional case sensitive $like test.

CrosswaveOmega avatar Mar 04 '24 19:03 CrosswaveOmega

Hi, any idea when this great feature could be merged? I think we all need it so muuuch :D

matined avatar Mar 21 '24 13:03 matined

When will this awesome feature be merged^_^

gyula-coder avatar Apr 28 '24 12:04 gyula-coder

Hi all, is there still interest in proceeding with this pull request? Is it normal for a pull request to reach a year or so for the Chroma project? (I am just trying to determine the interest in merging this to know whether to keep waiting)

pevogam avatar Jun 17 '24 15:06 pevogam

@pevogam I'm still interested in proceeding, though from what I understand it can't be merged yet because like/nlike doesn't work with distributed chroma, and would need to be implemented in the rust backend for distributed chroma.

CrosswaveOmega avatar Jun 17 '24 16:06 CrosswaveOmega

This is sorely needed feature in Chroma. I hope it gets merged asap.

gururise avatar Jun 17 '24 16:06 gururise

Hi all, is there still interest in proceeding with this pull request? Is it normal for a pull request to reach a year or so for the Chroma project? (I am just trying to determine the interest in merging this to know whether to keep waiting)

Just switch to a different vector store. E.g. milvus

svenpieper avatar Jun 17 '24 16:06 svenpieper

@pevogam I'm still interested in proceeding, though from what I understand it can't be merged yet because like/nlike doesn't work with distributed chroma, and would need to be implemented in the rust backend for distributed chroma.

IIUC a feature available in some Chroma versions is still better than a feature that is not available in any Chroma version. But I might misunderstand what you mean or what the maintainers need - partial feature is still net positive as long as it doesn't introduce a regression anywhere else.

pevogam avatar Jun 18 '24 15:06 pevogam

Waiting for this to be merged

simplast avatar Jul 19 '24 08:07 simplast

@pevogam I'm still interested in proceeding, though from what I understand it can't be merged yet because like/nlike doesn't work with distributed chroma, and would need to be implemented in the rust backend for distributed chroma.

@CrosswaveOmega, thanks for the work. Is there any additional issue related to distributed chroma feature? Maybe we can find out what needs to be done and continue implementing/merging this feature. Any information would be appreciated regarding related issues or tasks

sevgicansalih avatar Aug 22 '24 19:08 sevgicansalih