ibis
ibis copied to clipboard
bug: Errors with NULL columns
What happened?
When a column has only nulls, I can't create a DuckDB table or pyarrow table from it.
Here are some reproducible examples:
import ibis
con = ibis.duckdb.connect()
data = [{"col1": 1, "col2": None}, {"col1": 4, "col2": None}]
t = ibis.memtable(data)
con.create_table("test", t)
Result: ParserException: Parser Error: syntax error at or near "NULL"
t.execute() does work in the above example.
import ibis
con = ibis.duckdb.connect()
data = [{"col1": 1, "col2": None}, {"col1": 4, "col2": None}]
ibis.memtable(data).to_pyarrow()
Result: ArrowNotImplementedError: Unsupported cast from int32 to null using function cast_null
import ibis
import pyarrow as pa
con = ibis.duckdb.connect()
data = [{"col1": 1, "col2": None}, {"col1": 4, "col2": None}]
array = pa.array(data)
pa_table = pa.Table.from_struct_array(array)
con.create_table("test", pa_table)
Result: ParserException: Parser Error: syntax error at or near "NULL"
I am guessing the problem is that PyArrow supports columns having a datatype of null, which most databases probably don't?
DuckDB apparently converts the null datatype into int32:
import duckdb
duckdb.from_arrow(pa_table)
┌───────┬───────┐
│ col1 │ col2 │
│ int64 │ int32 │
├───────┼───────┤
│ 1 │ NULL │
│ 4 │ NULL │
└───────┴───────┘
I have no idea what the best way to handle this is, maybe raising an exception asking the user to specify a schema when a NULL column exists?
A job I run daily suddenly started giving the first error. With the error I got, it took some experimenting to figure out that it was actually caused by this issue. A column in the source data (from some API) that usually has strings and nulls now had only nulls.
What version of ibis are you using?
9.2.0
What backend(s) are you using, if any?
DuckDB
Relevant log output
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Based on
A job I run daily suddenly started giving the first error. With the error I got, it took some experimenting to figure out that it was actually caused by this issue. A column in the source data (from some API) that usually has strings and nulls now had only nulls.
it sounds like you know the type of the column: string.
The best thing to do then is to pass that information to ibis.memtable, like so:
In [1]: import ibis
...:
...: con = ibis.duckdb.connect()
...:
...: data = [{"col1": 1, "col2": None}, {"col1": 4, "col2": None}]
...:
...: t = ibis.memtable(data, schema={"col1": "int", "col2": "str"})
...: con.create_table("test", t)
Out[1]:
DatabaseTable: memory.main.test
col1 int64
col2 string
In [2]: from ibis.interactive import *
In [3]: con.tables.test
Out[3]:
┏━━━━━━━┳━━━━━━━━┓
┃ col1 ┃ col2 ┃
┡━━━━━━━╇━━━━━━━━┩
│ int64 │ string │
├───────┼────────┤
│ 1 │ NULL │
│ 4 │ NULL │
└───────┴────────┘
Thanks! Yeah, once I knew the cause solving the immediate problem was not difficult, but I still thought I'd raise this issue as the current behavior might not be ideal.
@Riezebos Any interest in submitting a PR?
Adding this to the memtable docstring somewhere would be the place to start!
I'd be interested, but then I'd rather like to fix it in a way that doesn't give the errors I had. Both of the errors don't make it clear that the cause of the problem is a column full of nulls. The current way of handling null columns with duckdb/sqlglot seems a little broken, I'll explain why I think so below.
memtable to pyarrow (via duckdb)
When using .to_pyarrow() on the memtable, it is first converted to duckdb using _to_duckdb_relation. Duckdb automatically converts null columns to int32. Then PyArrowData.convert_table is used on the result which gives:
ArrowNotImplementedError: Unsupported cast from int32 to null using function cast_null
The schema of the expression still thinks the datatype of the column is null in self.schema(), but it actually is int32 as duckdb changed it.
This can be checked using:
import ibis
data = [{"col1": 1, "col2": None}, {"col1": 4, "col2": None}]
t = ibis.memtable(data)
con = ibis.duckdb.connect()
print(con._to_duckdb_relation(t).arrow())
Output:
pyarrow.Table
col1: int64
col2: int32
----
col1: [[1,4]]
col2: [[null,null]]
Relevant lines:
- https://github.com/ibis-project/ibis/blob/main/ibis/expr/types/relations.py#L202
- https://github.com/ibis-project/ibis/blob/main/ibis/backends/duckdb/init.py#L1395
Potential solutions for this would be:
- Cast the column datatype to null after it has been converted from duckdb to arrow (before comparing the pa.table schema to the desired_schema
- Don't try to cast the table to desired_schema in cases like this (maybe giving a warning that the null column has been cast to int32 by duckdb)
create_table in duckdb from memtable
I'm pretty sure this just fails because sqlglot writes a query that tries to create a table with a datatype of NULL which duckdb doesn't support. The method self.compiler.type_mapper.from_ibis(typ) when getting the column definitions returns DataType(this=Type.NULL).
This can be tested by running this after the example above:
con.compiler.type_mapper.from_ibis(t.schema().types[1])
Relevant lines:
- https://github.com/ibis-project/ibis/blob/main/ibis/backends/duckdb/init.py#L206
Potential solutions for this would be:
- Raise a warning/exception when create_table is called with a table that has a null column in its schema (the warning/exception could tell the user to add a schema with a different datatype as an argument)
- Cast the column to int32 like duckdb does (with a warning)
- Add an optional argument to create_table:
cast_null_columns_to: ibis.expr.datatypes.DataType = None. A datatype can be given here to convert null columns to.
I don't know if null columns is something which ibis wants to handle fully, but the null datatype does exist, it works in pyarrow and somewhat works in duckdb. Without solving the above problems, the combination of ibis+duckdb+null column is a bit broken in my opinion.
I personally think that in both cases solution 1 is the best solution. If you agree with the problems and the solutions, I'd be happy to start working on a PR.