ibis
ibis copied to clipboard
feat: methods to execute and return pyarrow Table or iterator of RecordBatches
Currently users call expr.execute() to execute an ibis expression and get back a pandas.DataFrame.
For efficiency and ease of interoperation with other tools, it would be useful to add some alternative result representations:
- A
pyarrow.Table - An iterator of
pyarrow.RecordBatchobjects
The second would allow for streaming back potentially large results without requiring the results fit in memory. The first could be implemented using the second using something like:
def to_pyarrow(self):
return pa.Table.from_batches(self.to_pyarrow_batches())
however, some backends may have a more efficient method of returning a Table than that (e.g. duckdb's fetch_record_batch).
Since some additional options may show up on these methods (perhaps a batch size for the record batches version), I think it would be best to implement these as new methods on Table rather than overloading the existing execute method. No strong thoughts on what the methods should be called, for now I'll suggest:
to_pyarrow(self) -> pa.Tableto_pyarrow_batches(self) -> Iterator[pa.RecordBatch]
For the sqlalchemy backends that don't have special methods there's going to be at least a minor annoyance here in that the values returned from a cursor look like tuples and dicts but in fact are sqlalchemy.engine.row.LegacyRow and sqlalchemy.engine.row.RowMapping.
It seems like pyarrow really doesn't like these, e.g.
In [168]: batch[:5]
Out[168]: [(1, 2173), (1, 943), (1, 892), (1, 30), (1, 337)]
In [169]: pa_schema = pa.struct([("l_orderkey", pa.int32()), ("l_partkey", pa.int32())])
In [170]: pa.array(batch[:5], type=pa_schema)
---------------------------------------------------------------------------
ArrowTypeError Traceback (most recent call last)
Input In [170], in <cell line: 1>()
----> 1 pa.array(batch[:5], type=pa_schema)
File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:317, in pyarrow.lib.array()
File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array()
File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
File /nix/store/z9qn3g22d8nx1x4mgzq3497iy8ji5h8x-python3-3.10.6-env/lib/python3.10/site-packages/pyarrow/error.pxi:123, in pyarrow.lib.check_status()
ArrowTypeError: Could not convert 1 with type int: was expecting tuple of (key, value) pair
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:938 GetKeyValuePair(items, i)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1010 InferKeyKind(items)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/iterators.h:73 func(value, static_cast<int64_t>(i), &keep_going)
/build/apache-arrow-9.0.0/cpp/src/arrow/python/python_to_arrow.cc:1182 converter->Extend(seq, size)
vs a "proper" tuple
In [171]: pa.array(map(tuple, batch[:5]), type=pa_schema)
Out[171]:
<pyarrow.lib.StructArray object at 0x7fd4fb52d660>
-- is_valid: all not null
-- child 0 type: int32
[
1,
1,
1,
1,
1
]
-- child 1 type: int32
[
2173,
943,
892,
30,
337
]
For a first cut, we can call map so something works, but I wonder if we can fix this either on the sqlalchemy or pyarrow side (or maybe there's an option in sqlalchemy that I'm missing that already allows for this)
Ideally we can change pyarrow to allow Sequences and Mappings through instead of being so strict about the specific type.
Upstream issue with pyarrow: https://issues.apache.org/jira/browse/ARROW-17582
Fixed by #4454. Closing.