StructuredQuery: "and/or" Operation should never have just one argument

This PR adds a validation step for StructuredQuery instances with single-argument and/or Operations

Context

I have some metadata attributes on my Chroma docs, and I create a SelfQueryRetriever with this information:

from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info=[
    AttributeInfo(
        name="source",
        description="File path to the source document", 
        type="string", 
    ),
    AttributeInfo(
        name="data_scope",
        description="Type/scope of linguistic data in document",
        type="string", 
    ),
    AttributeInfo(
        name="verse_ref",
        description="Complete BOK CH:VS reference for verse (in USFM format)",
        type="string", 
    ),
    AttributeInfo(
        name="book",
        description="Book name",
        type="string", 
    ),
    AttributeInfo(
        name="chapter",
        description="Chapter number",
        type="integer", 
    ),
    AttributeInfo(
        name="verse",
        description="Verse number",
        type="integer", 
    ),
]
document_content_description = "Linguistic data about a bible verse"
retriever = SelfQueryRetriever.from_llm(llm, context_chroma, document_content_description, metadata_field_info, verbose=True)

Problem encountered

When I try to retrieve documents, the parser may return a StructuredQuery with only one argument wrapped in an Operation (e.g., 'and', 'or').

Input:

print(retriever.get_relevant_documents('jesus speaks to peter in the book of matthew'))

Output (with some extra print statements):

these were the inputs: {'query': 'jesus speaks to peter in the book of matthew'} 

this was the query: query='jesus speaks to peter' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='book', value='Matthew')]) limit=None

And then we encounter an error when we try to actually query the Chroma database:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[28], line 1
----> 1 print(retriever.get_relevant_documents('jesus speaks to peter in the book of matthew'))
      2 print(retriever.get_relevant_documents('jesus speaks to peter in Luke 9:20'))

File /opt/homebrew/lib/python3.10/site-packages/langchain/retrievers/self_query/base.py:104, in SelfQueryRetriever.get_relevant_documents(self, query)
    101     new_kwargs["k"] = structured_query.limit
    103 search_kwargs = {**self.search_kwargs, **new_kwargs}
--> 104 docs = self.vectorstore.search(new_query, self.search_type, **search_kwargs)
    105 return docs

File /opt/homebrew/lib/python3.10/site-packages/langchain/vectorstores/base.py:82, in VectorStore.search(self, query, search_type, **kwargs)
     80 """Return docs most similar to query using specified search type."""
     81 if search_type == "similarity":
---> 82     return self.similarity_search(query, **kwargs)
     83 elif search_type == "mmr":
     84     return self.max_marginal_relevance_search(query, **kwargs)

File /opt/homebrew/lib/python3.10/site-packages/langchain/vectorstores/chroma.py:182, in Chroma.similarity_search(self, query, k, filter, **kwargs)
    165 def similarity_search(
    166     self,
    167     query: str,
   (...)
    170     **kwargs: Any,
    171 ) -> List[Document]:
    172     """Run similarity search with Chroma.
    173 
    174     Args:
   (...)
    180         List[Document]: List of documents most similar to the query text.
    181     """
--> 182     docs_and_scores = self.similarity_search_with_score(query, k, filter=filter)
    183     return [doc for doc, _ in docs_and_scores]

File /opt/homebrew/lib/python3.10/site-packages/langchain/vectorstores/chroma.py:229, in Chroma.similarity_search_with_score(self, query, k, filter, **kwargs)
    227 else:
    228     query_embedding = self._embedding_function.embed_query(query)
--> 229     results = self.__query_collection(
    230         query_embeddings=[query_embedding], n_results=k, where=filter
    231     )
    233 return _results_to_docs_and_scores(results)

File /opt/homebrew/lib/python3.10/site-packages/langchain/utils.py:52, in xor_args..decorator..wrapper(*args, **kwargs)
     46     invalid_group_names = [", ".join(arg_groups[i]) for i in invalid_groups]
     47     raise ValueError(
     48         "Exactly one argument in each of the following"
     49         " groups must be defined:"
     50         f" {', '.join(invalid_group_names)}"
     51     )
---> 52 return func(*args, **kwargs)

File /opt/homebrew/lib/python3.10/site-packages/langchain/vectorstores/chroma.py:121, in Chroma.__query_collection(self, query_texts, query_embeddings, n_results, where, **kwargs)
    119 for i in range(n_results, 0, -1):
    120     try:
--> 121         return self._collection.query(
    122             query_texts=query_texts,
    123             query_embeddings=query_embeddings,
    124             n_results=i,
    125             where=where,
    126             **kwargs,
    127         )
    128     except chromadb.errors.NotEnoughElementsException:
    129         logger.error(
    130             f"Chroma collection {self._collection.name} "
    131             f"contains fewer than {i} elements."
    132         )

File /opt/homebrew/lib/python3.10/site-packages/chromadb/api/models/Collection.py:188, in Collection.query(self, query_embeddings, query_texts, n_results, where, where_document, include)
    161 def query(
    162     self,
    163     query_embeddings: Optional[OneOrMany[Embedding]] = None,
   (...)
    168     include: Include = ["metadatas", "documents", "distances"],
    169 ) -> QueryResult:
    170     """Get the n_results nearest neighbor embeddings for provided query_embeddings or query_texts.
    171 
    172     Args:
   (...)
    186 
    187     """
--> 188     where = validate_where(where) if where else None
    189     where_document = (
    190         validate_where_document(where_document) if where_document else None
    191     )
    192     query_embeddings = (
    193         validate_embeddings(maybe_cast_one_to_many(query_embeddings))
    194         if query_embeddings is not None
    195         else None
    196     )

File /opt/homebrew/lib/python3.10/site-packages/chromadb/api/types.py:148, in validate_where(where)
    144     raise ValueError(
    145         f"Expected where value for $and or $or to be a list of where expressions, got {value}"
    146     )
    147 if len(value) <= 1:
--> 148     raise ValueError(
    149         f"Expected where value for $and or $or to be a list with at least two where expressions, got {value}"
    150     )
    151 for where_expression in value:
    152     validate_where(where_expression)

ValueError: Expected where value for $and or $or to be a list with at least two where expressions, got [{'book': {'$eq': 'Matthew'}}]

Solution implemented

When there should be one argument (and thus no `Operation` wrapper)

With my code modifications, this input:

# Query that should have only one argument:
print(retriever.get_relevant_documents('jesus speaks to peter in the book of matthew'))

generates this output:

these were the inputs:  {'query': 'jesus speaks to peter in the book of matthew'}

this was the query:  query='jesus speaks to peter' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='book', value='Matthew')]) limit=None

Only one argument provided to the Operation. Passing argument directly instead of wrapping in Operation.

query='jesus speaks to peter' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='book', value='Matthew') limit=None

[Document(page_content="Social-Situational Context:\n\nThis word is part of the passage 'The Denial of Peter'\n  - This passage is a Forewarning/Private Discussion situation, which can be described in typical terms as follows: [...]  - Interpersonal activity focus pertains to the social interaction between participants, focusing on their roles, relationships, and attitudes.", metadata={'source': '/Users/ryderwishart/genesis/itemized_prose_contexts/MAT 26:75.txt_Social-Situational.txt', 'data_scope': 'Social-Situational', 'verse_ref': 'MAT 26:75', 'book': 'Matthew', 'chapter': '26', 'verse': '75'})]

When there should be multiple arguments (and thus there should be an `Operation` wrapper)

This input:

# Query that should have multiple arguments:
print(retriever.get_relevant_documents('jesus speaks to peter in Luke 9:20'))

generates this output:

these were the inputs:  {'query': 'jesus speaks to peter in Luke 9:20'}

this was the query:  query='jesus speaks to peter' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='book', value='Luke'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='chapter', value=9), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='verse', value=20)]) limit=None

[Document(page_content="Social-Situational Context:\n\nThis word is part of the passage 'Peter's Confession and Christ's Answer'\n  - This passage [...] ός:\n\nDomain label: Whom or What Spoken or Written About\nCultural information for εἰμί:\n\nDomain label: State', metadata={'source': '/Users/ryderwishart/genesis/itemized_prose_contexts/LUK 9:20.txt_Cultural-encyclopedic.txt', 'data_scope': 'Cultural-encyclopedic', 'verse_ref': 'LUK 9:20', 'book': 'Luke', 'chapter': '9', 'verse': '20'})]

Conclusion

In short, the function correctly drops the Operation wrapper if there is only one argument passed to it.

Contribution guidelines

I ran the formatting and linting. I don't think any of these linting errors are a result of my code:

% make lint

poetry run mypy .

langchain/evaluation/loading.py:5: **error:** Incompatible import of **"load_dataset"** (imported name has type **"Callable[[str, Optional[str], Optional[str], Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]], None], Union[str, Split, None], Optional[str], Optional[Features], Optional[DownloadConfig], Optional[GenerateMode], bool, Optional[bool], bool, Union[str, Version, None], Union[bool, str, None], Union[str, TaskTemplate, None], bool, Any, KwArg(Any)], Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]]"**, local name has type **"Callable[[str], List[Dict[Any, Any]]]"**)  [assignment]

langchain/evaluation/loading.py:8: **error:** No overload variant of **"__getitem__"** of **"list"** matches argument type **"str"**  [call-overload]

langchain/evaluation/loading.py:8: note: Possible overload variants:

langchain/evaluation/loading.py:8: note:     def __getitem__(self, SupportsIndex, /) -> Dict[Any, Any]

langchain/evaluation/loading.py:8: note:     def __getitem__(self, slice, /) -> List[Dict[Any, Any]]

langchain/vectorstores/mongodb_atlas.py:185: **error:** Argument 1 to **"aggregate"** of **"Collection"** has incompatible type **"List[object]"**; expected **"Sequence[Mapping[str, Any]]"**  [arg-type]

langchain/document_loaders/hugging_face_dataset.py:81: **error:** Item **"Dataset"** of **"Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]"** has no attribute **"keys"**  [union-attr]

langchain/document_loaders/hugging_face_dataset.py:81: **error:** Item **"IterableDataset"** of **"Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]"** has no attribute **"keys"**  [union-attr]

langchain/document_loaders/hugging_face_dataset.py:82: **error:** Value of type **"Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]"** is not indexable  [index]

**Found 6 errors in 3 files (checked 1086 source files)**

make: *** [lint] Error 1

Who can review?

Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:

@dev2049 (authored original code) @hwchase17 (co-authored original code)

Jun 01 '23 16:06 ryderwishart

not sure i understand the problem. AND/OR(element1) is technically a valid (thought inelegant) logical statement, no?

Jun 02 '23 19:06 dev2049

not sure i understand the problem. AND/OR(element1) is technically a valid (thought inelegant) logical statement, no?

Depends on the logic being implemented. Chroma is clearly assuming and means a combination of statements, and or means an alternation. However they justify it, it throws an error:

ValueError: Expected where value for $and or $or to be a list with at least two where expressions, got [{'book': {'$eq': 'Matthew'}}]

The arguments aren't equivalent to some or any (again, speaking just for Chroma, though potentially other vectorstores).

Jun 02 '23 19:06 ryderwishart

not sure i understand the problem. AND/OR(element1) is technically a valid (thought inelegant) logical statement, no?

Depends on the logic being implemented. Chroma is clearly assuming and means a combination of statements, and or means an alternation. However they justify it, it throws an error:
ValueError: Expected where value for $and or $or to be a list with at least two where expressions, got [{'book': {'$eq': 'Matthew'}}]
The arguments aren't equivalent to some or any (again, speaking just for Chroma, though potentially other vectorstores).

ah i see, missed that chroma was throwing actual errors. thanks for explaining!

Jun 02 '23 21:06 dev2049

merging in @dev2049 fix - thanks for flagging and discussion @ryderwishart !

Jun 03 '23 22:06 hwchase17

StructuredQuery: "and/or" Operation should never have just one argument

StructuredQuery: "and/or" Operation should never have just one argument

Context

Problem encountered

Solution implemented

When there should be one argument (and thus no Operation wrapper)

When there should be multiple arguments (and thus there should be an Operation wrapper)

Conclusion

Contribution guidelines

Who can review?

When there should be one argument (and thus no `Operation` wrapper)

When there should be multiple arguments (and thus there should be an `Operation` wrapper)