kuzu
kuzu copied to clipboard
Reading parquet in Python has some rough edges, making it hard to diagnose problems
I've upgraded to the latest version 0.0.9
and there are a few rough edges with reading parquet from Python.
Problem scenario
If I have a parquet file with two columns, id
and name
, of type INT64
and STRING
respectively.
During development, sometimes we may specify the wrong data type during table creation, for example in this case:
def create_node_table(conn: Connection) -> None:
conn.execute(
"""
CREATE NODE TABLE
Data(
id STRING,
name STRING
)
"""
)
In this case, the data type of the id should be INT64
, but was specified as STRING
, it just segfaults, making it very hard for the Python developer to know what they did wrong (it's not a Kùzu bug, and could have very easily been fixed if the user just had a better error message).
[1] 5208 segmentation fault python test.py
In the reverse situation, the error is even more complex. Say I have a parquet file with the same two columns, id
and name
, but this time, the id
column is also of type STRING
like the name
column is.
def create_node_table(conn: Connection) -> None:
conn.execute(
"""
CREATE NODE TABLE
Data(
id INT64,
name STRING
)
"""
)
This time, if the user specified the id
column in the node table creation as of the type INT64
(by mistake, when they should have specified STRING
), it doesn't segfault but instead gives this weird error.
File "/code/.venv/lib/python3.11/site-packages/kuzu/connection.py", line 88, in execute
self._connection.execute(
RuntimeError: ColumnChunk::templateCopyStringArrowArray
What should be a 2-second fix to the data type now becomes a tedious exercise for the Python developer to go back and inspect the data and learn (after some deliberation) that the data type specified in Python was wrong.
Developer experience focus
These sorts of issues are more generic, and I suppose adding these edge cases to the test suite could help avoid such issues, but it's impossible for the Kùzu team to know all the kinds of ways users may break the system.
In this case, malformed data types (or the user simply making a mistake) is a common enough scenario that the test suite can be expanded, but the longer term problem of C++ error propagation to Python is still something to ponder about. What should be a very simple fix, can take minutes of unnecessary developer effort to build even the simplest graphs. This is quite a big source of user frustration in Python, and it's pretty natural to expect a large portion of future Kùzu users to come from Python, so this is a serious enough issue to discuss -- I think it really makes sense to deep-dive into how to present better error messages to Python users, cc @andyfengHKU @semihsalihoglu-uw