kuzu Reading parquet in Python has some rough edges, making it hard to diagnose problems

Reading parquet in Python has some rough edges, making it hard to diagnose problems

Open prrao87 opened this issue 1 year ago • 5 comments

I've upgraded to the latest version 0.0.9 and there are a few rough edges with reading parquet from Python.

Problem scenario

If I have a parquet file with two columns, id and name, of type INT64 and STRING respectively.

During development, sometimes we may specify the wrong data type during table creation, for example in this case:

def create_node_table(conn: Connection) -> None:
    conn.execute(
        """
        CREATE NODE TABLE
            Data(
                id STRING,
                name STRING
            )
        """
    )

In this case, the data type of the id should be INT64, but was specified as STRING, it just segfaults, making it very hard for the Python developer to know what they did wrong (it's not a Kùzu bug, and could have very easily been fixed if the user just had a better error message).

[1]    5208 segmentation fault  python test.py

In the reverse situation, the error is even more complex. Say I have a parquet file with the same two columns, id and name, but this time, the id column is also of type STRING like the name column is.

def create_node_table(conn: Connection) -> None:
    conn.execute(
        """
        CREATE NODE TABLE
            Data(
                id INT64,
                name STRING
            )
        """
    )

This time, if the user specified the id column in the node table creation as of the type INT64 (by mistake, when they should have specified STRING), it doesn't segfault but instead gives this weird error.

  File "/code/.venv/lib/python3.11/site-packages/kuzu/connection.py", line 88, in execute
    self._connection.execute(
RuntimeError: ColumnChunk::templateCopyStringArrowArray

What should be a 2-second fix to the data type now becomes a tedious exercise for the Python developer to go back and inspect the data and learn (after some deliberation) that the data type specified in Python was wrong.

Developer experience focus

These sorts of issues are more generic, and I suppose adding these edge cases to the test suite could help avoid such issues, but it's impossible for the Kùzu team to know all the kinds of ways users may break the system.

In this case, malformed data types (or the user simply making a mistake) is a common enough scenario that the test suite can be expanded, but the longer term problem of C++ error propagation to Python is still something to ponder about. What should be a very simple fix, can take minutes of unnecessary developer effort to build even the simplest graphs. This is quite a big source of user frustration in Python, and it's pretty natural to expect a large portion of future Kùzu users to come from Python, so this is a serious enough issue to discuss -- I think it really makes sense to deep-dive into how to present better error messages to Python users, cc @andyfengHKU @semihsalihoglu-uw

Oct 03 '23 21:10 prrao87

kuzu kuzu copied to clipboard

Reading parquet in Python has some rough edges, making it hard to diagnose problems

Problem scenario

Developer experience focus

kuzu
kuzu copied to clipboard