StructuredQuery: "and/or" Operation should never have just one argument
StructuredQuery: "and/or" Operation should never have just one argument
This PR adds a validation step for StructuredQuery instances with single-argument and/or Operations
Context
I have some metadata attributes on my Chroma docs, and I create a SelfQueryRetriever with this information:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
metadata_field_info=[
AttributeInfo(
name="source",
description="File path to the source document",
type="string",
),
AttributeInfo(
name="data_scope",
description="Type/scope of linguistic data in document",
type="string",
),
AttributeInfo(
name="verse_ref",
description="Complete BOK CH:VS reference for verse (in USFM format)",
type="string",
),
AttributeInfo(
name="book",
description="Book name",
type="string",
),
AttributeInfo(
name="chapter",
description="Chapter number",
type="integer",
),
AttributeInfo(
name="verse",
description="Verse number",
type="integer",
),
]
document_content_description = "Linguistic data about a bible verse"
retriever = SelfQueryRetriever.from_llm(llm, context_chroma, document_content_description, metadata_field_info, verbose=True)
Problem encountered
When I try to retrieve documents, the parser may return a StructuredQuery with only one argument wrapped in an Operation (e.g., 'and', 'or').
Input:
print(retriever.get_relevant_documents('jesus speaks to peter in the book of matthew'))
Output (with some extra print statements):
these were the inputs: {'query': 'jesus speaks to peter in the book of matthew'}
this was the query: query='jesus speaks to peter' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='book', value='Matthew')]) limit=None
And then we encounter an error when we try to actually query the Chroma database:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[28], line 1
----> 1 print(retriever.get_relevant_documents('jesus speaks to peter in the book of matthew'))
2 print(retriever.get_relevant_documents('jesus speaks to peter in Luke 9:20'))
File /opt/homebrew/lib/python3.10/site-packages/langchain/retrievers/self_query/base.py:104, in SelfQueryRetriever.get_relevant_documents(self, query)
101 new_kwargs["k"] = structured_query.limit
103 search_kwargs = {**self.search_kwargs, **new_kwargs}
--> 104 docs = self.vectorstore.search(new_query, self.search_type, **search_kwargs)
105 return docs
File /opt/homebrew/lib/python3.10/site-packages/langchain/vectorstores/base.py:82, in VectorStore.search(self, query, search_type, **kwargs)
80 """Return docs most similar to query using specified search type."""
81 if search_type == "similarity":
---> 82 return self.similarity_search(query, **kwargs)
83 elif search_type == "mmr":
84 return self.max_marginal_relevance_search(query, **kwargs)
File /opt/homebrew/lib/python3.10/site-packages/langchain/vectorstores/chroma.py:182, in Chroma.similarity_search(self, query, k, filter, **kwargs)
165 def similarity_search(
166 self,
167 query: str,
(...)
170 **kwargs: Any,
171 ) -> List[Document]:
172 """Run similarity search with Chroma.
173
174 Args:
(...)
180 List[Document]: List of documents most similar to the query text.
181 """
--> 182 docs_and_scores = self.similarity_search_with_score(query, k, filter=filter)
183 return [doc for doc, _ in docs_and_scores]
File /opt/homebrew/lib/python3.10/site-packages/langchain/vectorstores/chroma.py:229, in Chroma.similarity_search_with_score(self, query, k, filter, **kwargs)
227 else:
228 query_embedding = self._embedding_function.embed_query(query)
--> 229 results = self.__query_collection(
230 query_embeddings=[query_embedding], n_results=k, where=filter
231 )
233 return _results_to_docs_and_scores(results)
File /opt/homebrew/lib/python3.10/site-packages/langchain/utils.py:52, in xor_args..decorator..wrapper(*args, **kwargs)
46 invalid_group_names = [", ".join(arg_groups[i]) for i in invalid_groups]
47 raise ValueError(
48 "Exactly one argument in each of the following"
49 " groups must be defined:"
50 f" {', '.join(invalid_group_names)}"
51 )
---> 52 return func(*args, **kwargs)
File /opt/homebrew/lib/python3.10/site-packages/langchain/vectorstores/chroma.py:121, in Chroma.__query_collection(self, query_texts, query_embeddings, n_results, where, **kwargs)
119 for i in range(n_results, 0, -1):
120 try:
--> 121 return self._collection.query(
122 query_texts=query_texts,
123 query_embeddings=query_embeddings,
124 n_results=i,
125 where=where,
126 **kwargs,
127 )
128 except chromadb.errors.NotEnoughElementsException:
129 logger.error(
130 f"Chroma collection {self._collection.name} "
131 f"contains fewer than {i} elements."
132 )
File /opt/homebrew/lib/python3.10/site-packages/chromadb/api/models/Collection.py:188, in Collection.query(self, query_embeddings, query_texts, n_results, where, where_document, include)
161 def query(
162 self,
163 query_embeddings: Optional[OneOrMany[Embedding]] = None,
(...)
168 include: Include = ["metadatas", "documents", "distances"],
169 ) -> QueryResult:
170 """Get the n_results nearest neighbor embeddings for provided query_embeddings or query_texts.
171
172 Args:
(...)
186
187 """
--> 188 where = validate_where(where) if where else None
189 where_document = (
190 validate_where_document(where_document) if where_document else None
191 )
192 query_embeddings = (
193 validate_embeddings(maybe_cast_one_to_many(query_embeddings))
194 if query_embeddings is not None
195 else None
196 )
File /opt/homebrew/lib/python3.10/site-packages/chromadb/api/types.py:148, in validate_where(where)
144 raise ValueError(
145 f"Expected where value for $and or $or to be a list of where expressions, got {value}"
146 )
147 if len(value) <= 1:
--> 148 raise ValueError(
149 f"Expected where value for $and or $or to be a list with at least two where expressions, got {value}"
150 )
151 for where_expression in value:
152 validate_where(where_expression)
ValueError: Expected where value for $and or $or to be a list with at least two where expressions, got [{'book': {'$eq': 'Matthew'}}]
Solution implemented
When there should be one argument (and thus no Operation wrapper)
With my code modifications, this input:
# Query that should have only one argument:
print(retriever.get_relevant_documents('jesus speaks to peter in the book of matthew'))
generates this output:
these were the inputs: {'query': 'jesus speaks to peter in the book of matthew'}
this was the query: query='jesus speaks to peter' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='book', value='Matthew')]) limit=None
Only one argument provided to the Operation. Passing argument directly instead of wrapping in Operation.
query='jesus speaks to peter' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='book', value='Matthew') limit=None
[Document(page_content="Social-Situational Context:\n\nThis word is part of the passage 'The Denial of Peter'\n - This passage is a Forewarning/Private Discussion situation, which can be described in typical terms as follows: [...] - Interpersonal activity focus pertains to the social interaction between participants, focusing on their roles, relationships, and attitudes.", metadata={'source': '/Users/ryderwishart/genesis/itemized_prose_contexts/MAT 26:75.txt_Social-Situational.txt', 'data_scope': 'Social-Situational', 'verse_ref': 'MAT 26:75', 'book': 'Matthew', 'chapter': '26', 'verse': '75'})]
When there should be multiple arguments (and thus there should be an Operation wrapper)
This input:
# Query that should have multiple arguments:
print(retriever.get_relevant_documents('jesus speaks to peter in Luke 9:20'))
generates this output:
these were the inputs: {'query': 'jesus speaks to peter in Luke 9:20'}
this was the query: query='jesus speaks to peter' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='book', value='Luke'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='chapter', value=9), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='verse', value=20)]) limit=None
[Document(page_content="Social-Situational Context:\n\nThis word is part of the passage 'Peter's Confession and Christ's Answer'\n - This passage [...] ός:\n\nDomain label: Whom or What Spoken or Written About\nCultural information for εἰμί:\n\nDomain label: State', metadata={'source': '/Users/ryderwishart/genesis/itemized_prose_contexts/LUK 9:20.txt_Cultural-encyclopedic.txt', 'data_scope': 'Cultural-encyclopedic', 'verse_ref': 'LUK 9:20', 'book': 'Luke', 'chapter': '9', 'verse': '20'})]
Conclusion
In short, the function correctly drops the Operation wrapper if there is only one argument passed to it.
Contribution guidelines
I ran the formatting and linting. I don't think any of these linting errors are a result of my code:
% make lint
poetry run mypy .
langchain/evaluation/loading.py:5: **error:** Incompatible import of **"load_dataset"** (imported name has type **"Callable[[str, Optional[str], Optional[str], Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]], None], Union[str, Split, None], Optional[str], Optional[Features], Optional[DownloadConfig], Optional[GenerateMode], bool, Optional[bool], bool, Union[str, Version, None], Union[bool, str, None], Union[str, TaskTemplate, None], bool, Any, KwArg(Any)], Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]]"**, local name has type **"Callable[[str], List[Dict[Any, Any]]]"**)Â [assignment]
langchain/evaluation/loading.py:8: **error:** No overload variant of **"__getitem__"** of **"list"** matches argument type **"str"**Â [call-overload]
langchain/evaluation/loading.py:8: note: Possible overload variants:
langchain/evaluation/loading.py:8: note: Â Â def __getitem__(self, SupportsIndex, /) -> Dict[Any, Any]
langchain/evaluation/loading.py:8: note: Â Â def __getitem__(self, slice, /) -> List[Dict[Any, Any]]
langchain/vectorstores/mongodb_atlas.py:185: **error:** Argument 1 to **"aggregate"** of **"Collection"** has incompatible type **"List[object]"**; expected **"Sequence[Mapping[str, Any]]"**Â [arg-type]
langchain/document_loaders/hugging_face_dataset.py:81: **error:** Item **"Dataset"** of **"Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]"** has no attribute **"keys"**Â [union-attr]
langchain/document_loaders/hugging_face_dataset.py:81: **error:** Item **"IterableDataset"** of **"Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]"** has no attribute **"keys"**Â [union-attr]
langchain/document_loaders/hugging_face_dataset.py:82: **error:** Value of type **"Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]"** is not indexable [index]
**Found 6 errors in 3 files (checked 1086 source files)**
make: *** [lint] Error 1
Who can review?
Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:
@dev2049 (authored original code) @hwchase17 (co-authored original code)
not sure i understand the problem. AND/OR(element1) is technically a valid (thought inelegant) logical statement, no?
not sure i understand the problem. AND/OR(element1) is technically a valid (thought inelegant) logical statement, no?
Depends on the logic being implemented. Chroma is clearly assuming and means a combination of statements, and or means an alternation. However they justify it, it throws an error:
ValueError: Expected where value for $and or $or to be a list with at least two where expressions, got [{'book': {'$eq': 'Matthew'}}]
The arguments aren't equivalent to some or any (again, speaking just for Chroma, though potentially other vectorstores).
not sure i understand the problem. AND/OR(element1) is technically a valid (thought inelegant) logical statement, no?
Depends on the logic being implemented. Chroma is clearly assuming and means a combination of statements, and or means an alternation. However they justify it, it throws an error:
ValueError: Expected where value for $and or $or to be a list with at least two where expressions, got [{'book': {'$eq': 'Matthew'}}]The arguments aren't equivalent to
someorany(again, speaking just for Chroma, though potentially other vectorstores).
ah i see, missed that chroma was throwing actual errors. thanks for explaining!
merging in @dev2049 fix - thanks for flagging and discussion @ryderwishart !