ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat(datafusion): add ArrayDistinct operation

Open IndexSeek opened this issue 1 year ago • 0 comments

Description of changes

Adds support for the ArrayDistinct operation on the DataFusion backend.

https://datafusion.apache.org/user-guide/expressions.html#array-expressions

I was running into an issue where if a nan was present, the row count being returned was different. It was raising the following:

Exception: Internal error: UDF returned a different number of rows than expected. Expected: 6, Got: 5.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

I hope I marked this correctly in the test; this may require an upstream issue.

In [1]: from ibis.interactive import *

In [2]: t = ibis.memtable({"a": [[1, 3, 3], [], [42, 42], [], [None], None]})

In [3]: con = ibis.connect("datafusion://")

In [4]: expr = t.select("a", uniqued=_.a.unique())

In [5]: con.execute(expr.filter(~_.a.isnull()))
Out[5]: 
                 a     uniqued
0  [1.0, 3.0, 3.0]  [1.0, 3.0]
1               []          []
2     [42.0, 42.0]      [42.0]
3               []          []
4            [nan]       [nan]

In [6]: con.execute(expr)
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 con.execute(expr)

File ~/ibis/ibis/backends/datafusion/__init__.py:565, in Backend.execute(self, expr, **kwargs)
    562 def execute(self, expr: ir.Expr, **kwargs: Any):
    563     batch_reader = self.to_pyarrow_batches(expr, **kwargs)
    564     return expr.__pandas_result__(
--> 565         batch_reader.read_pandas(timestamp_as_object=True)
    566     )

File /nix/store/h6dzdmg2hy4mcgry6r2y45nzcdqn5z7h-python3-3.12.6-env/lib/python3.12/site-packages/pyarrow/ipc.pxi:617, in pyarrow.lib._ReadPandasMixin.read_pandas()

File /nix/store/h6dzdmg2hy4mcgry6r2y45nzcdqn5z7h-python3-3.12.6-env/lib/python3.12/site-packages/pyarrow/ipc.pxi:762, in pyarrow.lib.RecordBatchReader.read_all()

File /nix/store/h6dzdmg2hy4mcgry6r2y45nzcdqn5z7h-python3-3.12.6-env/lib/python3.12/site-packages/pyarrow/error.pxi:89, in pyarrow.lib.check_status()

File ~/ibis/ibis/backends/datafusion/__init__.py:542, in Backend.to_pyarrow_batches.<locals>.make_gen()
    541 def make_gen():
--> 542     yield from (
    543         # convert the renamed + casted columns into a record batch
    544         pa.RecordBatch.from_struct_array(
    545             # rename columns to match schema because datafusion lowercases things
    546             pa.RecordBatch.from_arrays(batch.to_pyarrow().columns, names=names)
    547             # cast the struct array to the desired types to work around
    548             # https://github.com/apache/arrow-datafusion-python/issues/534
    549             .to_struct_array()
    550             .cast(struct_schema, safe=False)
    551         )
    552         for batch in frame.execute_stream()
    553     )

File ~/ibis/ibis/backends/datafusion/__init__.py:552, in <genexpr>(.0)
    541 def make_gen():
    542     yield from (
    543         # convert the renamed + casted columns into a record batch
    544         pa.RecordBatch.from_struct_array(
    545             # rename columns to match schema because datafusion lowercases things
    546             pa.RecordBatch.from_arrays(batch.to_pyarrow().columns, names=names)
    547             # cast the struct array to the desired types to work around
    548             # https://github.com/apache/arrow-datafusion-python/issues/534
    549             .to_struct_array()
    550             .cast(struct_schema, safe=False)
    551         )
--> 552         for batch in frame.execute_stream()
    553     )

File /nix/store/h6dzdmg2hy4mcgry6r2y45nzcdqn5z7h-python3-3.12.6-env/lib/python3.12/site-packages/datafusion/record_batch.py:71, in RecordBatchStream.__next__(self)
     69 def __next__(self) -> RecordBatch:
     70     """Iterator function."""
---> 71     next_batch = next(self.rbs)
     72     return RecordBatch(next_batch)

Exception: Internal error: UDF returned a different number of rows than expected. Expected: 6, Got: 5.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

IndexSeek avatar Oct 19 '24 01:10 IndexSeek