kuzu
kuzu copied to clipboard
Binder Exception when reading ARRAY type from Parquet (for embeddings)
I'm pushing typecasting to the limit here 😅. Am basically trying to ensure I have fine-grained control over each column's data type all the way from Python to Parquet and then into Kùzu. My aim is to replicate a typical Python workflow that would be used in similarity search.
Dependencies
I'll be using sentence-transformers
to generate embeddings from raw text, as is common in many real-world scenarios.
pip install pyarrow polars kuzu sentence-transformers
Code
I first write out the vectors computed by an embedding model (from sentence-transformers
) alongside the raw data to Parquet, so that I can bulk-import the data to Kùzu (computing vectors/embeddings via a model is expensive, so this would be pre-computed in a real scenario).
Note that I explicitly typecast the integers from Python (by default INT64
) to UINT64
so that I can have unsigned integers in Kùzu per the graph schema below.
import os
import shutil
import kuzu
def create_db(conn):
conn.execute(
"""
CREATE NODE TABLE Person(
id UINT64,
name STRING,
age UINT8,
PRIMARY KEY (id)
)
"""
)
conn.execute(
"""
CREATE NODE TABLE Item(
id UINT64,
name STRING,
vector DOUBLE[384],
PRIMARY KEY (id)
)
"""
)
conn.execute(
"""
CREATE REL TABLE Purchased(
FROM Person
TO Item
)
"""
)
def write_data_to_parquet():
import warnings
import polars as pl
from sentence_transformers import SentenceTransformer
warnings.filterwarnings("ignore")
model = SentenceTransformer("Snowflake/snowflake-arctic-embed-xs")
persons = [
{"id": 1, "name": "Karissa", "age": 25},
{"id": 2, "name": "Zhang", "age": 29},
{"id": 3, "name": "Noura", "age": 31},
]
items = [
{"id": 1, "name": "espresso machine", "vector": list(model.encode("espresso machine"))},
{"id": 2, "name": "yoga mat", "vector": list(model.encode("yoga mat"))},
]
purchased = [
{"from": 1, "to": 1},
{"from": 1, "to": 2},
{"from": 2, "to": 1},
{"from": 3, "to": 2},
]
# Carefully typecast in Polars prior to exporting to Parquet so we can have unsigned integers in Kùzu
df_persons = pl.DataFrame(persons).with_columns(
pl.col("id").cast(pl.UInt64),
pl.col("age").cast(pl.UInt8),
)
# Ensure that the `ARRAY` data type is output for the `vector` column prior to exporting to Parquet
df_items = pl.DataFrame(items).with_columns(
pl.col("id").cast(pl.UInt64),
pl.col("vector").cast(pl.Array(pl.Float64, width=384)),
)
df_purchased = pl.DataFrame(purchased).with_columns(
pl.col("from").cast(pl.UInt64),
pl.col("to").cast(pl.UInt64),
)
print(df_persons)
print(df_items)
df_persons.write_parquet("persons.parquet")
df_items.write_parquet("items.parquet")
df_purchased.write_parquet("purchased.parquet")
def build_graph(conn):
conn.execute(
"""
COPY Person FROM 'persons.parquet';
COPY Item FROM 'items.parquet';
COPY Purchased FROM 'purchased.parquet';
"""
)
print("Finished importing nodes and rels")
if __name__ == "__main__":
if os.path.exists("./vdb"):
shutil.rmtree("./vdb")
# Create database
db = kuzu.Database("./vdb")
conn = kuzu.Connection(db)
create_db(conn)
write_data_to_parquet()
# Load data from parquet to graph
build_graph(conn)
Error
Running the above code gives the following error:
Traceback (most recent call last):
File "/Users/prrao/code/kuzu-debug/load_graph_similarity.py", line 107, in <module>
build_graph(conn)
File "/Users/prrao/code/kuzu-debug/load_graph_similarity.py", line 85, in build_graph
conn.execute(
File "/Users/prrao/code/kuzu-debug/.venv/lib/python3.11/site-packages/kuzu/connection.py", line 144, in execute
raise RuntimeError(_query_result.getErrorMessage())
RuntimeError: Binder exception: Column `vector` type mismatch. Expected DOUBLE[384] but got DOUBLE[].
Workaround
The error disappears when I change the schema to specify the vector
column as a LIST
, i.e., by stating vector DOUBLE[]
in the schema.
Desired behaviour
If we have the node table storing ARRAY
values (based on the imported fields from Parquet), we would be able to easily run array functions for similarity search by providing a simple Python list, as the similarity search functions require at least one of the two arguments to be of type ARRAY
for it to perform implicit casting. This would make the downstream Cypher query that performs similarity search less verbose and cleaner to write.
Currently, I have to write this query for it to work (requires explicit casting and the user to know a lot more about the correct syntax):
res = conn.execute(
"""
MATCH (i:Item)
WITH i, CAST($query_vector, "DOUBLE[384]") AS query_vector
RETURN i.name as name, array_cosine_similarity(i.vector, query_vector) AS similarity
ORDER BY similarity DESC
""",
parameters={"query_vector": query_vector}
)
What I want to be able to write is the below simpler query that doesn't require explicit casting by the user:
res = conn.execute(
"""
MATCH (i:Item)
RETURN i.name as name, array_cosine_similarity(i.vector, $query_vector) AS similarity
ORDER BY similarity DESC
""",
parameters={"query_vector": query_vector}
)