ibis feat: methods to execute and return pyarrow Table or iterator of RecordBatches

Currently users call expr.execute() to execute an ibis expression and get back a pandas.DataFrame.

For efficiency and ease of interoperation with other tools, it would be useful to add some alternative result representations:

A pyarrow.Table
An iterator of pyarrow.RecordBatch objects

The second would allow for streaming back potentially large results without requiring the results fit in memory. The first could be implemented using the second using something like:

def to_pyarrow(self):
    return pa.Table.from_batches(self.to_pyarrow_batches())

however, some backends may have a more efficient method of returning a Table than that (e.g. duckdb's fetch_record_batch).

Since some additional options may show up on these methods (perhaps a batch size for the record batches version), I think it would be best to implement these as new methods on Table rather than overloading the existing execute method. No strong thoughts on what the methods should be called, for now I'll suggest:

to_pyarrow(self) -> pa.Table
to_pyarrow_batches(self) -> Iterator[pa.RecordBatch]

Aug 30 '22 14:08 jcrist

For the sqlalchemy backends that don't have special methods there's going to be at least a minor annoyance here in that the values returned from a cursor look like tuples and dicts but in fact are sqlalchemy.engine.row.LegacyRow and sqlalchemy.engine.row.RowMapping.

It seems like pyarrow really doesn't like these, e.g.

In [168]: batch[:5]
Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)]

In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", pa.int32())])

In [170]: pa.array(batch[:5], type=pa_schema)
---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
Input In [170], in <cell line: 1>()
----> 1 pa.array(batch[:5], type=pa_schema)

File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317, in pyarrow.lib.array()

File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array()

File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123, in pyarrow.lib.check_status()

ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, value) pair
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938  GetKeyValuePair(items, i)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010  InferKeyKind(items)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73  func(value, static_cast<int64_t>(i), &keep_going)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182  converter->Extend(seq, size)

vs a "proper" tuple

In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema)
Out[171]: 
<pyarrow.lib.StructArray object at 0x7fd4fb52d660>
-- is_valid: all not null
-- child 0 type: int32
  [
    1,
    1,
    1,
    1,
    1
  ]
-- child 1 type: int32
  [
    2173,
    943,
    892,
    30,
    337
  ]

For a first cut, we can call map so something works, but I wonder if we can fix this either on the sqlalchemy or pyarrow side (or maybe there's an option in sqlalchemy that I'm missing that already allows for this)

Aug 31 '22 13:08 gforsyth

Ideally we can change pyarrow to allow Sequences and Mappings through instead of being so strict about the specific type.

Aug 31 '22 14:08 cpcloud

Upstream issue with pyarrow: https://issues.apache.org/jira/browse/ARROW-17582

Aug 31 '22 15:08 gforsyth

Fixed by #4454. Closing.

Oct 18 '22 14:10 jcrist