chroma
chroma copied to clipboard
[ENH][TST] more advanced metadata string operations filtering with $like and $nlike operators
More advanced string filtering for Metadata with $like and $nlike operators.
I needed to be able to filter results based on if some metadata fields contained a certain substring, however chroma's string filtering for metadata was too limited.
Therefore, I modified the library to allow for usable $like
and $nlike
operators
that can be used with any Collection's get
,query
,and delete
methods.
Description of changes
Added $like and $nlike operands for where
kwarg filtering
- Improvements & Bug fixes
- ...
- New functionality
- The ability to include the "$like" and "$nlike" operators with the
where
kwarg. - Typing for these two new operators has been added to
chromadb/types.py
- These operands are meant to be used with "%" and "_" wildcard operators.
- operands throw an error if you attempt to use them with a non string value.
- added new tests for evaluating the "$like" operator and "$nlike" operators.
- The ability to include the "$like" and "$nlike" operators with the
Example usage
collection=chromaclient.get_collection('example_collection')
results=collection.get(where={"string_value": {"$like": "%example%"}})
Test plan
How are these changes tested?
- [x] Tests pass locally with
pytest
for python,yarn test
for js
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?
Mild documentation changes will be required.
Reviewer Checklist
Please leverage this checklist to ensure your code review is thorough before approving
Testing, Bugs, Errors, Logs, Documentation
- [ ] Can you think of any use case in which the code does not behave as intended? Have they been tested?
- [ ] Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
- [ ] If appropriate, are there adequate property based tests?
- [ ] If appropriate, are there adequate unit tests?
- [ ] Should any logging, debugging, tracing information be added or removed?
- [ ] Are error messages user-friendly?
- [ ] Have all documentation changes needed been made?
- [ ] Have all non-obvious changes been commented?
System Compatibility
- [ ] Are there any potential impacts on other parts of the system or backward compatibility?
- [ ] Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?
Quality
- [ ] Is this code of a unexpectedly high quality (Readbility, Modularity, Intuitiveness)
@CrosswaveOmega, this is awesome. Are you familiar with the hypothesis framework? The reason for asking is that, ideally, we want property tests that verify various (statistical) properties of querying with filters. Have a look here:
https://github.com/chroma-core/chroma/pull/1081/files#diff-403dd784c71c55196b6483bc3977c92f24a327a20b3d73429c1ba9a99ef9ef43
- chromadb/test/property/strategies.py
- chromadb/test/property/test_filtering.py
@tazarov I am not familiar with the hypothesis framework, though I can learn it if needed.
What statistical properties would need to be verified in the case of the like operator?
I think it may be nice to match the $contains syntax of the where_document clause. And @tazarov had been working on supporting $not.
My main concern here is that without indexes this can be quite slow.
Initially, the plan was to simply match the $contains syntax from the where_document clause, until I noticed that one of the tests for validation was for ensuring that $contains couldn't be used in this context. Which is why I chose to use $like and $nlike instead, since I could use wildcard characters that way.
I actually haven't noticed any index related speed issues yet while testing, though I'm pretty sure that's because my sqlite version has automatic indexing on by default. If need be, a special metadata index creation method for specific keys could be implemented.
I reviewed the hypothesis framework and implemented a variant of the metadata filtering tests for $like/$nlike.
Because $like is filtering based on a substring, additional metadata keys are added to the known_metadata_keys, and are populated with random combinations of the known_document_keywords (as to not add yet another set of known keywords to the collection test object.)
ensuring that this PR remains compatible with the changes in main.
Really looking forward to being able to use $like $nlike filtering. Hope this gets merged soon!
@CrosswaveOmega just checking in if you plan to continue here :)
Implemented the nitpicks brought up by @tazarov after consideration.
@jeffchuber I do plan on continuing here.
I'm not sure what else should be done for this pull request.
Is there anything else in particular I should focus on before a final review & merge?
If there is anything I can do to help reviewing/merging this please PR let me know.
Waiting for this to be merged
Any chance of getting this merged? The lack of a '$like' filtering operator in default Chroma is really holding us back. @jeffchuber
I've updated the tests based on the feedback by @tazarov, including the additional case sensitive $like
test.
Hi, any idea when this great feature could be merged? I think we all need it so muuuch :D
When will this awesome feature be merged^_^
Hi all, is there still interest in proceeding with this pull request? Is it normal for a pull request to reach a year or so for the Chroma project? (I am just trying to determine the interest in merging this to know whether to keep waiting)
@pevogam I'm still interested in proceeding, though from what I understand it can't be merged yet because like/nlike doesn't work with distributed chroma, and would need to be implemented in the rust backend for distributed chroma.
This is sorely needed feature in Chroma. I hope it gets merged asap.
Hi all, is there still interest in proceeding with this pull request? Is it normal for a pull request to reach a year or so for the Chroma project? (I am just trying to determine the interest in merging this to know whether to keep waiting)
Just switch to a different vector store. E.g. milvus
@pevogam I'm still interested in proceeding, though from what I understand it can't be merged yet because like/nlike doesn't work with distributed chroma, and would need to be implemented in the rust backend for distributed chroma.
IIUC a feature available in some Chroma versions is still better than a feature that is not available in any Chroma version. But I might misunderstand what you mean or what the maintainers need - partial feature is still net positive as long as it doesn't introduce a regression anywhere else.
Waiting for this to be merged
@pevogam I'm still interested in proceeding, though from what I understand it can't be merged yet because like/nlike doesn't work with distributed chroma, and would need to be implemented in the rust backend for distributed chroma.
@CrosswaveOmega, thanks for the work. Is there any additional issue related to distributed chroma feature? Maybe we can find out what needs to be done and continue implementing/merging this feature. Any information would be appreciated regarding related issues or tasks