langchain icon indicating copy to clipboard operation
langchain copied to clipboard

StructuredQuery: "and/or" Operation should never have just one argument

Open ryderwishart opened this issue 2 years ago • 3 comments

StructuredQuery: "and/or" Operation should never have just one argument

This PR adds a validation step for StructuredQuery instances with single-argument and/or Operations

Context

I have some metadata attributes on my Chroma docs, and I create a SelfQueryRetriever with this information:

from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info=[
    AttributeInfo(
        name="source",
        description="File path to the source document", 
        type="string", 
    ),
    AttributeInfo(
        name="data_scope",
        description="Type/scope of linguistic data in document",
        type="string", 
    ),
    AttributeInfo(
        name="verse_ref",
        description="Complete BOK CH:VS reference for verse (in USFM format)",
        type="string", 
    ),
    AttributeInfo(
        name="book",
        description="Book name",
        type="string", 
    ),
    AttributeInfo(
        name="chapter",
        description="Chapter number",
        type="integer", 
    ),
    AttributeInfo(
        name="verse",
        description="Verse number",
        type="integer", 
    ),
]
document_content_description = "Linguistic data about a bible verse"
retriever = SelfQueryRetriever.from_llm(llm, context_chroma, document_content_description, metadata_field_info, verbose=True)

Problem encountered

When I try to retrieve documents, the parser may return a StructuredQuery with only one argument wrapped in an Operation (e.g., 'and', 'or').

Input:

print(retriever.get_relevant_documents('jesus speaks to peter in the book of matthew'))

Output (with some extra print statements):

these were the inputs: {'query': 'jesus speaks to peter in the book of matthew'} 

this was the query: query='jesus speaks to peter' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='book', value='Matthew')]) limit=None

And then we encounter an error when we try to actually query the Chroma database:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[28], line 1
----> 1 print(retriever.get_relevant_documents('jesus speaks to peter in the book of matthew'))
      2 print(retriever.get_relevant_documents('jesus speaks to peter in Luke 9:20'))

File /opt/homebrew/lib/python3.10/site-packages/langchain/retrievers/self_query/base.py:104, in SelfQueryRetriever.get_relevant_documents(self, query)
    101     new_kwargs["k"] = structured_query.limit
    103 search_kwargs = {**self.search_kwargs, **new_kwargs}
--> 104 docs = self.vectorstore.search(new_query, self.search_type, **search_kwargs)
    105 return docs

File /opt/homebrew/lib/python3.10/site-packages/langchain/vectorstores/base.py:82, in VectorStore.search(self, query, search_type, **kwargs)
     80 """Return docs most similar to query using specified search type."""
     81 if search_type == "similarity":
---> 82     return self.similarity_search(query, **kwargs)
     83 elif search_type == "mmr":
     84     return self.max_marginal_relevance_search(query, **kwargs)

File /opt/homebrew/lib/python3.10/site-packages/langchain/vectorstores/chroma.py:182, in Chroma.similarity_search(self, query, k, filter, **kwargs)
    165 def similarity_search(
    166     self,
    167     query: str,
   (...)
    170     **kwargs: Any,
    171 ) -> List[Document]:
    172     """Run similarity search with Chroma.
    173 
    174     Args:
   (...)
    180         List[Document]: List of documents most similar to the query text.
    181     """
--> 182     docs_and_scores = self.similarity_search_with_score(query, k, filter=filter)
    183     return [doc for doc, _ in docs_and_scores]

File /opt/homebrew/lib/python3.10/site-packages/langchain/vectorstores/chroma.py:229, in Chroma.similarity_search_with_score(self, query, k, filter, **kwargs)
    227 else:
    228     query_embedding = self._embedding_function.embed_query(query)
--> 229     results = self.__query_collection(
    230         query_embeddings=[query_embedding], n_results=k, where=filter
    231     )
    233 return _results_to_docs_and_scores(results)

File /opt/homebrew/lib/python3.10/site-packages/langchain/utils.py:52, in xor_args..decorator..wrapper(*args, **kwargs)
     46     invalid_group_names = [", ".join(arg_groups[i]) for i in invalid_groups]
     47     raise ValueError(
     48         "Exactly one argument in each of the following"
     49         " groups must be defined:"
     50         f" {', '.join(invalid_group_names)}"
     51     )
---> 52 return func(*args, **kwargs)

File /opt/homebrew/lib/python3.10/site-packages/langchain/vectorstores/chroma.py:121, in Chroma.__query_collection(self, query_texts, query_embeddings, n_results, where, **kwargs)
    119 for i in range(n_results, 0, -1):
    120     try:
--> 121         return self._collection.query(
    122             query_texts=query_texts,
    123             query_embeddings=query_embeddings,
    124             n_results=i,
    125             where=where,
    126             **kwargs,
    127         )
    128     except chromadb.errors.NotEnoughElementsException:
    129         logger.error(
    130             f"Chroma collection {self._collection.name} "
    131             f"contains fewer than {i} elements."
    132         )

File /opt/homebrew/lib/python3.10/site-packages/chromadb/api/models/Collection.py:188, in Collection.query(self, query_embeddings, query_texts, n_results, where, where_document, include)
    161 def query(
    162     self,
    163     query_embeddings: Optional[OneOrMany[Embedding]] = None,
   (...)
    168     include: Include = ["metadatas", "documents", "distances"],
    169 ) -> QueryResult:
    170     """Get the n_results nearest neighbor embeddings for provided query_embeddings or query_texts.
    171 
    172     Args:
   (...)
    186 
    187     """
--> 188     where = validate_where(where) if where else None
    189     where_document = (
    190         validate_where_document(where_document) if where_document else None
    191     )
    192     query_embeddings = (
    193         validate_embeddings(maybe_cast_one_to_many(query_embeddings))
    194         if query_embeddings is not None
    195         else None
    196     )

File /opt/homebrew/lib/python3.10/site-packages/chromadb/api/types.py:148, in validate_where(where)
    144     raise ValueError(
    145         f"Expected where value for $and or $or to be a list of where expressions, got {value}"
    146     )
    147 if len(value) <= 1:
--> 148     raise ValueError(
    149         f"Expected where value for $and or $or to be a list with at least two where expressions, got {value}"
    150     )
    151 for where_expression in value:
    152     validate_where(where_expression)

ValueError: Expected where value for $and or $or to be a list with at least two where expressions, got [{'book': {'$eq': 'Matthew'}}]

Solution implemented

When there should be one argument (and thus no Operation wrapper)

With my code modifications, this input:

# Query that should have only one argument:
print(retriever.get_relevant_documents('jesus speaks to peter in the book of matthew'))

generates this output:

these were the inputs:  {'query': 'jesus speaks to peter in the book of matthew'}

this was the query:  query='jesus speaks to peter' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='book', value='Matthew')]) limit=None

Only one argument provided to the Operation. Passing argument directly instead of wrapping in Operation.

query='jesus speaks to peter' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='book', value='Matthew') limit=None

[Document(page_content="Social-Situational Context:\n\nThis word is part of the passage 'The Denial of Peter'\n  - This passage is a Forewarning/Private Discussion situation, which can be described in typical terms as follows: [...]  - Interpersonal activity focus pertains to the social interaction between participants, focusing on their roles, relationships, and attitudes.", metadata={'source': '/Users/ryderwishart/genesis/itemized_prose_contexts/MAT 26:75.txt_Social-Situational.txt', 'data_scope': 'Social-Situational', 'verse_ref': 'MAT 26:75', 'book': 'Matthew', 'chapter': '26', 'verse': '75'})]

When there should be multiple arguments (and thus there should be an Operation wrapper)

This input:

# Query that should have multiple arguments:
print(retriever.get_relevant_documents('jesus speaks to peter in Luke 9:20'))

generates this output:

these were the inputs:  {'query': 'jesus speaks to peter in Luke 9:20'}

this was the query:  query='jesus speaks to peter' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='book', value='Luke'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='chapter', value=9), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='verse', value=20)]) limit=None

[Document(page_content="Social-Situational Context:\n\nThis word is part of the passage 'Peter's Confession and Christ's Answer'\n  - This passage [...] ός:\n\nDomain label: Whom or What Spoken or Written About\nCultural information for εἰμί:\n\nDomain label: State', metadata={'source': '/Users/ryderwishart/genesis/itemized_prose_contexts/LUK 9:20.txt_Cultural-encyclopedic.txt', 'data_scope': 'Cultural-encyclopedic', 'verse_ref': 'LUK 9:20', 'book': 'Luke', 'chapter': '9', 'verse': '20'})]

Conclusion

In short, the function correctly drops the Operation wrapper if there is only one argument passed to it.

Contribution guidelines

I ran the formatting and linting. I don't think any of these linting errors are a result of my code:

% make lint

poetry run mypy .

langchain/evaluation/loading.py:5: **error:** Incompatible import of **"load_dataset"** (imported name has type **"Callable[[str, Optional[str], Optional[str], Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]], None], Union[str, Split, None], Optional[str], Optional[Features], Optional[DownloadConfig], Optional[GenerateMode], bool, Optional[bool], bool, Union[str, Version, None], Union[bool, str, None], Union[str, TaskTemplate, None], bool, Any, KwArg(Any)], Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]]"**, local name has type **"Callable[[str], List[Dict[Any, Any]]]"**)  [assignment]

langchain/evaluation/loading.py:8: **error:** No overload variant of **"__getitem__"** of **"list"** matches argument type **"str"**  [call-overload]

langchain/evaluation/loading.py:8: note: Possible overload variants:

langchain/evaluation/loading.py:8: note:     def __getitem__(self, SupportsIndex, /) -> Dict[Any, Any]

langchain/evaluation/loading.py:8: note:     def __getitem__(self, slice, /) -> List[Dict[Any, Any]]

langchain/vectorstores/mongodb_atlas.py:185: **error:** Argument 1 to **"aggregate"** of **"Collection"** has incompatible type **"List[object]"**; expected **"Sequence[Mapping[str, Any]]"**  [arg-type]

langchain/document_loaders/hugging_face_dataset.py:81: **error:** Item **"Dataset"** of **"Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]"** has no attribute **"keys"**  [union-attr]

langchain/document_loaders/hugging_face_dataset.py:81: **error:** Item **"IterableDataset"** of **"Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]"** has no attribute **"keys"**  [union-attr]

langchain/document_loaders/hugging_face_dataset.py:82: **error:** Value of type **"Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]"** is not indexable  [index]

**Found 6 errors in 3 files (checked 1086 source files)**

make: *** [lint] Error 1

Who can review?

Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:

@dev2049 (authored original code) @hwchase17 (co-authored original code)

ryderwishart avatar Jun 01 '23 16:06 ryderwishart

not sure i understand the problem. AND/OR(element1) is technically a valid (thought inelegant) logical statement, no?

dev2049 avatar Jun 02 '23 19:06 dev2049

not sure i understand the problem. AND/OR(element1) is technically a valid (thought inelegant) logical statement, no?

Depends on the logic being implemented. Chroma is clearly assuming and means a combination of statements, and or means an alternation. However they justify it, it throws an error:

ValueError: Expected where value for $and or $or to be a list with at least two where expressions, got [{'book': {'$eq': 'Matthew'}}]

The arguments aren't equivalent to some or any (again, speaking just for Chroma, though potentially other vectorstores).

ryderwishart avatar Jun 02 '23 19:06 ryderwishart

not sure i understand the problem. AND/OR(element1) is technically a valid (thought inelegant) logical statement, no?

Depends on the logic being implemented. Chroma is clearly assuming and means a combination of statements, and or means an alternation. However they justify it, it throws an error:

ValueError: Expected where value for $and or $or to be a list with at least two where expressions, got [{'book': {'$eq': 'Matthew'}}]

The arguments aren't equivalent to some or any (again, speaking just for Chroma, though potentially other vectorstores).

ah i see, missed that chroma was throwing actual errors. thanks for explaining!

dev2049 avatar Jun 02 '23 21:06 dev2049

merging in @dev2049 fix - thanks for flagging and discussion @ryderwishart !

hwchase17 avatar Jun 03 '23 22:06 hwchase17