ibis bug: Errors with NULL columns

What happened?

When a column has only nulls, I can't create a DuckDB table or pyarrow table from it.

Here are some reproducible examples:

import ibis

con = ibis.duckdb.connect()

data = [{"col1": 1, "col2": None}, {"col1": 4, "col2": None}]

t = ibis.memtable(data)
con.create_table("test", t)

Result: ParserException: Parser Error: syntax error at or near "NULL"

t.execute() does work in the above example.

import ibis

con = ibis.duckdb.connect()

data = [{"col1": 1, "col2": None}, {"col1": 4, "col2": None}]

ibis.memtable(data).to_pyarrow()

Result: ArrowNotImplementedError: Unsupported cast from int32 to null using function cast_null

import ibis
import pyarrow as pa

con = ibis.duckdb.connect()

data = [{"col1": 1, "col2": None}, {"col1": 4, "col2": None}]

array = pa.array(data)
pa_table = pa.Table.from_struct_array(array)

con.create_table("test", pa_table)

Result: ParserException: Parser Error: syntax error at or near "NULL"

I am guessing the problem is that PyArrow supports columns having a datatype of null, which most databases probably don't?

DuckDB apparently converts the null datatype into int32:

import duckdb

duckdb.from_arrow(pa_table)

┌───────┬───────┐
│ col1  │ col2  │
│ int64 │ int32 │
├───────┼───────┤
│     1 │  NULL │
│     4 │  NULL │
└───────┴───────┘

I have no idea what the best way to handle this is, maybe raising an exception asking the user to specify a schema when a NULL column exists?

A job I run daily suddenly started giving the first error. With the error I got, it took some experimenting to figure out that it was actually caused by this issue. A column in the source data (from some API) that usually has strings and nulls now had only nulls.

What version of ibis are you using?

9.2.0

What backend(s) are you using, if any?

DuckDB

Relevant log output

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Jul 23 '24 09:07 Riezebos

Based on

A job I run daily suddenly started giving the first error. With the error I got, it took some experimenting to figure out that it was actually caused by this issue. A column in the source data (from some API) that usually has strings and nulls now had only nulls.

it sounds like you know the type of the column: string.

The best thing to do then is to pass that information to ibis.memtable, like so:

In [1]: import ibis
   ...:
   ...: con = ibis.duckdb.connect()
   ...:
   ...: data = [{"col1": 1, "col2": None}, {"col1": 4, "col2": None}]
   ...:
   ...: t = ibis.memtable(data, schema={"col1": "int", "col2": "str"})
   ...: con.create_table("test", t)
Out[1]:
DatabaseTable: memory.main.test
  col1 int64
  col2 string

In [2]: from ibis.interactive import *

In [3]: con.tables.test
Out[3]:
┏━━━━━━━┳━━━━━━━━┓
┃ col1  ┃ col2   ┃
┡━━━━━━━╇━━━━━━━━┩
│ int64 │ string │
├───────┼────────┤
│     1 │ NULL   │
│     4 │ NULL   │
└───────┴────────┘

Jul 23 '24 10:07 cpcloud

Thanks! Yeah, once I knew the cause solving the immediate problem was not difficult, but I still thought I'd raise this issue as the current behavior might not be ideal.

Jul 23 '24 11:07 Riezebos

@Riezebos Any interest in submitting a PR?

Adding this to the memtable docstring somewhere would be the place to start!

Jul 24 '24 17:07 cpcloud

I'd be interested, but then I'd rather like to fix it in a way that doesn't give the errors I had. Both of the errors don't make it clear that the cause of the problem is a column full of nulls. The current way of handling null columns with duckdb/sqlglot seems a little broken, I'll explain why I think so below.

memtable to pyarrow (via duckdb)

When using .to_pyarrow() on the memtable, it is first converted to duckdb using _to_duckdb_relation. Duckdb automatically converts null columns to int32. Then PyArrowData.convert_table is used on the result which gives:

ArrowNotImplementedError: Unsupported cast from int32 to null using function cast_null

The schema of the expression still thinks the datatype of the column is null in self.schema(), but it actually is int32 as duckdb changed it.

This can be checked using:

import ibis

data = [{"col1": 1, "col2": None}, {"col1": 4, "col2": None}]
t = ibis.memtable(data)

con = ibis.duckdb.connect()
print(con._to_duckdb_relation(t).arrow())

Output:

pyarrow.Table
col1: int64
col2: int32
----
col1: [[1,4]]
col2: [[null,null]]

Relevant lines:

https://github.com/ibis-project/ibis/blob/main/ibis/expr/types/relations.py#L202
https://github.com/ibis-project/ibis/blob/main/ibis/backends/duckdb/init.py#L1395

Potential solutions for this would be:

Cast the column datatype to null after it has been converted from duckdb to arrow (before comparing the pa.table schema to the desired_schema
Don't try to cast the table to desired_schema in cases like this (maybe giving a warning that the null column has been cast to int32 by duckdb)

create_table in duckdb from memtable

I'm pretty sure this just fails because sqlglot writes a query that tries to create a table with a datatype of NULL which duckdb doesn't support. The method self.compiler.type_mapper.from_ibis(typ) when getting the column definitions returns DataType(this=Type.NULL).

This can be tested by running this after the example above:

con.compiler.type_mapper.from_ibis(t.schema().types[1])

Relevant lines:

https://github.com/ibis-project/ibis/blob/main/ibis/backends/duckdb/init.py#L206

Potential solutions for this would be:

Raise a warning/exception when create_table is called with a table that has a null column in its schema (the warning/exception could tell the user to add a schema with a different datatype as an argument)
Cast the column to int32 like duckdb does (with a warning)
Add an optional argument to create_table: cast_null_columns_to: ibis.expr.datatypes.DataType = None. A datatype can be given here to convert null columns to.

I don't know if null columns is something which ibis wants to handle fully, but the null datatype does exist, it works in pyarrow and somewhat works in duckdb. Without solving the above problems, the combination of ibis+duckdb+null column is a bit broken in my opinion.

I personally think that in both cases solution 1 is the best solution. If you agree with the problems and the solutions, I'd be happy to start working on a PR.

Jul 25 '24 13:07 Riezebos

ibis ibis copied to clipboard

bug: Errors with NULL columns

What happened?

What version of ibis are you using?

What backend(s) are you using, if any?

Relevant log output

Code of Conduct

memtable to pyarrow (via duckdb)

create_table in duckdb from memtable

ibis
ibis copied to clipboard