llama_index
llama_index copied to clipboard
[Bug]: Weaviate MetadataFilters break on "numeric" strings
Bug Description
After updating llama index from 0.8.50 to 0.9.35 my vector query broke. MetadataFilters that use strings containing numeric values throw error complaining that valueNumber does not match valueText
Version
0.9.35
Steps to Reproduce
All you have to do is have a numeric string value in a metadata filter against a valueText field and the error will occur.
Seems to be a result of https://github.com/run-llama/llama_index/blob/851399a303a47972fb62b9bb8880434842e23dc3/llama_index/vector_stores/weaviate.py#L78 but not sure why that line was added in the first place
import weaviate
from llama_index.schema import TextNode
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.vector_stores.types import MetadataFilters, MetadataFilter, VectorStoreQuery
client = weaviate.Client(url="http://localhost:8080")
vector_store = WeaviateVectorStore(weaviate_client=client, index_name="TestIndex", text_key="text")
vector_store.add(
[
TextNode(
text="test 1",
metadata={"article_id": "aaff"},
embedding=[0, 0, 1],
),
TextNode(
text="test 2",
metadata={"article_id": "1234"},
embedding=[0, 1, 0],
),
TextNode(
text="test 3",
metadata={"article_id": "3ff3"},
embedding=[1, 0, 0],
)
]
)
# -- working query
query = VectorStoreQuery(
query_embedding=[0, 0, 0],
similarity_top_k=2,
filters=MetadataFilters(
filters=[
MetadataFilter(key="article_id", value="3ff3") # NOTE: works fine because is not a number
]
)
)
results = vector_store.query(query)
print(results)
# -- below query breaks
query = VectorStoreQuery(
query_embedding=[0, 0, 0],
similarity_top_k=2,
filters=MetadataFilters(
filters=[
MetadataFilter(key="article_id", value="1234") # NOTE: breaks because it's a number but the metadata field is a text field
]
)
)
results = vector_store.query(query)
print(results)
Relevant Logs/Tracbacks
results = vector_store.query(query)
File "/usr/local/lib/python3.10/site-packages/llama_index/vector_stores/weaviate.py", line 338, in query
parsed_result = parse_get_response(query_result)
File "/usr/local/lib/python3.10/site-packages/llama_index/vector_stores/weaviate_utils.py", line 65, in parse_get_response
raise ValueError("Invalid query, got errors: {}".format(response["errors"]))
ValueError: Invalid query, got errors: [{'locations': [{'column': 6, 'line': 1}], 'message': 'invalid \'where\' filter: child operand at position 0: data type filter cannot use "valueNumber" on type "text", use "valueText" instead', 'path': ['Get', 'User91609c28065a470e96c78465a7f40d6dReport1']}]
🤖
Hello again @KristianMischke! Nice to see you around. Hope your day's going well. Let's dive into this issue you're facing.
Thank you for providing a detailed description of the issue you're facing. It seems like the problem is related to how the WeaviateVectorStore
class in LlamaIndex is handling metadata filters with numeric strings.
From the traceback you provided, the error message suggests that the WeaviateVectorStore
is trying to use the valueNumber
field for a text type, which is causing the issue. This is likely due to the fact that the WeaviateVectorStore
is not correctly identifying the type of the metadata filter value and is treating it as a number because it contains numeric characters, even though it's actually a string.
The line of code you pointed out in the weaviate.py
file is part of the _to_weaviate_filter
function, which is used to transform standard filters to Weaviate specific filters. This function is then used in the query
method of the WeaviateVectorStore
class to apply these filters to the query.
As for why this line was added, without more context it's hard to say for sure, but it's possible that it was intended to handle cases where the metadata filter value is actually a number, and not a string containing numeric characters.
Unfortunately, I wasn't able to find any changes in the handling of MetadataFilters in the WeaviateVectorStore
class between LlamaIndex versions 0.8.50 and 0.9.35, or any changes in the latest version 0.9.36, based on the provided context.
To resolve this issue, a possible solution could be to update the _to_weaviate_filter
function to correctly identify the type of the metadata filter value and use the appropriate field (valueNumber
for numbers and valueText
for strings) based on its type.
I hope this helps! If you have any more questions or need further clarification, please don't hesitate to ask.
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Thanks for raising @KristianMischke. It looks like you're right about where in the code this bug comes from. Just thinking of how best we can resolve this. We may need to expose another parameter for MetadataFilters
that stipulates how to deal with strings that can be made numeric.
MetadataFilter(key="article_id", value="1234", value_type="str")
where value_type is an Enum.
@logan-markewich what do you think?
@nerdai I wonder if theres a way to use weaviate to check the type of a field, and coerce to the proper type when creating the weaviate filters?
Good point. I agree -- that should be what we consider first here.
Assigned it P1 priority @logan-markewich. Please do change this if you feel the need to do so.
Yeah weaviate schema properties have data types: https://weaviate.io/developers/weaviate/config-refs/datatypes that could be used for converstion
I am also facing the same issue with llamaindex-raptor pack.
`vector_store = WeaviateVectorStore(weaviate_client=vdb_client, index_name="RaptorIndex", text_key="text")
retriever = RaptorRetriever( [], embed_model=embed_model, # used for embedding clusters llm=llm_model, # used for generating summaries vector_store=vector_store, # used for storage similarity_top_k=2, # top k for each layer, or overall top-k for collapsed mode="tree_traversal", # sets default mode )
query_engine = RetrieverQueryEngine.from_args( retriever, llm=llm_model )
response = query_engine.query("What baselines was RAPTOR compared against?")`
Error:-
{'data': {'Get': {'RaptorIndex': None}}, 'errors': [{'locations': [{'column': 6, 'line': 1}], 'message': 'invalid \'where\' filter: data type filter cannot use "valueInt" on type "number", use "valueNumber" instead', 'path': ['Get', 'RaptorIndex']}]}
Any issue with the above code?
Packages used:
llama-index-vector-stores-weaviate = "^0.1.4" llama-index-packs-raptor = "^0.1.3" llama-index-llms-ollama = "^0.1.2" llama-index-embeddings-ollama = "^0.1.2" umap-learn = "^0.5.6"
I've implemented a solution to address the identified issue. I made some minor modifications as I encountered a similar problem. More work needed as this is just a temporary solution. The following files have been updated to reflect the changes:
- llama_index/core/vector_stores/init.py
from llama_index.core.vector_stores.simple import SimpleVectorStore
from llama_index.core.vector_stores.types import (
ExactMatchFilter,
FilterCondition,
FilterOperator,
MetadataFilter,
MetadataFilters,
MetadataInfo,
VectorStoreQuery,
VectorStoreQueryResult,
VectorStoreInfo,
ValueDataType
)
__all__ = [
"VectorStoreQuery",
"VectorStoreQueryResult",
"MetadataFilters",
"MetadataFilter",
"MetadataInfo",
"ExactMatchFilter",
"FilterCondition",
"FilterOperator",
"SimpleVectorStore",
"VectorStoreInfo",
"ValueDataType"
]
- llama_index/vector_stores/weaviate/base.py
def _to_weaviate_filter(standard_filters: MetadataFilters) -> Dict[str, Any]:
filters_list = []
condition = standard_filters.condition or "and"
condition = _transform_weaviate_filter_condition(condition)
if standard_filters.filters:
for filter in standard_filters.filters:
print(filter)
value_type = "valueText"
if filter.value_type.casefold() == "string":
value_type = "valueText"
elif isinstance(filter.value, float):
value_type = "valueNumber"
elif isinstance(filter.value, int):
value_type = "valueInt"
elif isinstance(filter.value, str) and filter.value.isnumeric():
filter.value = float(filter.value)
value_type = "valueNumber"
filters_list.append(
{
"path": filter.key,
"operator": _transform_weaviate_filter_operator(filter.operator),
value_type: filter.value,
}
)
else:
return {}
if len(filters_list) == 1:
# If there is only one filter, return it directly
return filters_list[0]
return {"operands": filters_list, "operator": condition}
- llama_index/core/vector_stores/types.py
class ValueDataType(str, Enum):
"""Value Data Type Class."""
STRING = "STRING" # Text data
INTEGER = "INT" # Integer data
FLOAT = "FLOAT" # Floating-point data
BOOLEAN = "BOOLEAN" # True or False data
DATETIME = "DATETIME" # Date and time data
LOCATION = "LOCATION" # Geographical location data
# Add more data types as needed
def __str__(self):
"""Returns the string representation of the data type."""
return self.value
class MetadataFilter(BaseModel):
"""Comprehensive metadata filter for vector stores to support more operators.
Value uses Strict* types, as int, float and str are compatible types and were all
converted to string before.
See: https://docs.pydantic.dev/latest/usage/types/#strict-types
"""
key: str
value: Union[
StrictInt,
StrictFloat,
StrictStr,
List[Union[StrictInt, StrictFloat, StrictStr]],
]
value_type: ValueDataType = ValueDataType.STRING
operator: FilterOperator = FilterOperator.EQ
@classmethod
def from_dict(
cls,
filter_dict: Dict,
) -> "MetadataFilter":
"""Create MetadataFilter from dictionary.
Args:
filter_dict: Dict with key, value and operator.
"""
return MetadataFilter.parse_obj(filter_dict)
Usage:
MetadataFilters(
filters=[
MetadataFilter(
key="field_name",
value=field_value,
value_type=ValueDataType.STRING
)
for field_value in fields
],
condition=FilterCondition.AND
)
I hope this helps! Any suggestions or feedback are appreciated