ibis
ibis copied to clipboard
bug: `schema.apply_to` should recurse for nested types
Right now we enforce the output of expr.execute() matches expr.schema() through Schema.apply_to. This is nice since it ensures the result types are the same across backends.
However, right now apply_to doesn't recurse into nested types (e.g. arrays, structs, ...).
One outcome of this issue is that dependent on backend, a nested array type might be represented as a list or np.ndarray:
postgres
In [1]: import ibis
In [2]: con = ibis.connect("postgres://postgres:postgres@localhost:5432")
In [3]: sql = """
...: CREATE TABLE test (x REAL[][]);
...: INSERT INTO test VALUES (ARRAY[ARRAY[1, 2], ARRAY[3, 4]]), (ARRAY[ARRAY[4, 5]]);
...: """
In [4]: con.raw_sql(sql)
Out[4]: <sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f050403fac0>
In [5]: df = con.table("test").execute()
In [6]: df
Out[6]:
x
0 [[1.0, 2.0], [3.0, 4.0]]
1 [[4.0, 5.0]]
In [7]: df.x.iloc[0] # it's a list of lists
Out[7]: [[1.0, 2.0], [3.0, 4.0]]
duckdb
In [8]: con = ibis.connect("duckdb://")
In [9]: con.raw_sql(sql)
Out[9]: <sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f050403f610>
In [10]: df = con.table("test").execute()
In [11]: df
Out[11]:
x
0 [[1.0, 2.0], [3.0, 4.0]]
1 [[4.0, 5.0]]
In [12]: df.x.iloc[0] # it's a list of numpy arrays
Out[12]: [array([1., 2.], dtype=float32), array([3., 4.], dtype=float32)]
I wonder if we should address this on a per-backend basis to avoid unnecessary data movement and copying.
Would an additional method for the arrow-based backends be sufficient? Something like
def to_pandas(table: pa.Table, schema: Schema) -> pd.DataFrame:
...
which could be more efficient than doing schema.apply_to(table.to_pandas())
SGTM! Since we're about to release 3.2, I'll mark this for 4.0
This seems to have been fixed, I can no longer reproduce the difference between these two backends.