ibis
ibis copied to clipboard
feat(datafusion): add ArrayDistinct operation
Description of changes
Adds support for the ArrayDistinct operation on the DataFusion backend.
https://datafusion.apache.org/user-guide/expressions.html#array-expressions
I was running into an issue where if a nan was present, the row count being returned was different. It was raising the following:
Exception: Internal error: UDF returned a different number of rows than expected. Expected: 6, Got: 5.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
I hope I marked this correctly in the test; this may require an upstream issue.
In [1]: from ibis.interactive import *
In [2]: t = ibis.memtable({"a": [[1, 3, 3], [], [42, 42], [], [None], None]})
In [3]: con = ibis.connect("datafusion://")
In [4]: expr = t.select("a", uniqued=_.a.unique())
In [5]: con.execute(expr.filter(~_.a.isnull()))
Out[5]:
a uniqued
0 [1.0, 3.0, 3.0] [1.0, 3.0]
1 [] []
2 [42.0, 42.0] [42.0]
3 [] []
4 [nan] [nan]
In [6]: con.execute(expr)
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Cell In[6], line 1
----> 1 con.execute(expr)
File ~/ibis/ibis/backends/datafusion/__init__.py:565, in Backend.execute(self, expr, **kwargs)
562 def execute(self, expr: ir.Expr, **kwargs: Any):
563 batch_reader = self.to_pyarrow_batches(expr, **kwargs)
564 return expr.__pandas_result__(
--> 565 batch_reader.read_pandas(timestamp_as_object=True)
566 )
File /nix/store/h6dzdmg2hy4mcgry6r2y45nzcdqn5z7h-python3-3.12.6-env/lib/python3.12/site-packages/pyarrow/ipc.pxi:617, in pyarrow.lib._ReadPandasMixin.read_pandas()
File /nix/store/h6dzdmg2hy4mcgry6r2y45nzcdqn5z7h-python3-3.12.6-env/lib/python3.12/site-packages/pyarrow/ipc.pxi:762, in pyarrow.lib.RecordBatchReader.read_all()
File /nix/store/h6dzdmg2hy4mcgry6r2y45nzcdqn5z7h-python3-3.12.6-env/lib/python3.12/site-packages/pyarrow/error.pxi:89, in pyarrow.lib.check_status()
File ~/ibis/ibis/backends/datafusion/__init__.py:542, in Backend.to_pyarrow_batches.<locals>.make_gen()
541 def make_gen():
--> 542 yield from (
543 # convert the renamed + casted columns into a record batch
544 pa.RecordBatch.from_struct_array(
545 # rename columns to match schema because datafusion lowercases things
546 pa.RecordBatch.from_arrays(batch.to_pyarrow().columns, names=names)
547 # cast the struct array to the desired types to work around
548 # https://github.com/apache/arrow-datafusion-python/issues/534
549 .to_struct_array()
550 .cast(struct_schema, safe=False)
551 )
552 for batch in frame.execute_stream()
553 )
File ~/ibis/ibis/backends/datafusion/__init__.py:552, in <genexpr>(.0)
541 def make_gen():
542 yield from (
543 # convert the renamed + casted columns into a record batch
544 pa.RecordBatch.from_struct_array(
545 # rename columns to match schema because datafusion lowercases things
546 pa.RecordBatch.from_arrays(batch.to_pyarrow().columns, names=names)
547 # cast the struct array to the desired types to work around
548 # https://github.com/apache/arrow-datafusion-python/issues/534
549 .to_struct_array()
550 .cast(struct_schema, safe=False)
551 )
--> 552 for batch in frame.execute_stream()
553 )
File /nix/store/h6dzdmg2hy4mcgry6r2y45nzcdqn5z7h-python3-3.12.6-env/lib/python3.12/site-packages/datafusion/record_batch.py:71, in RecordBatchStream.__next__(self)
69 def __next__(self) -> RecordBatch:
70 """Iterator function."""
---> 71 next_batch = next(self.rbs)
72 return RecordBatch(next_batch)
Exception: Internal error: UDF returned a different number of rows than expected. Expected: 6, Got: 5.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker