ibis
ibis copied to clipboard
feat: add to_pyarrow and to_pyarrow_batches
Adds to_pyarrow and to_pyarrow_batches to the alchemy backends, datafusion, and pandas. More to come.
Some open questions / issues:
Where should the schema inference stuff live? The type mapping is already defined in the pyarrow backend but it feels weird to import that backend into other backends.
chunk_size? chunksize? batch_size?
In this context, are chunks / batches properly defined in terms of rows or is that a conflation?
Datafusion has some slightly wonky behavior when it comes to a consistent schema across recordbatches
DuckDB does not seem to respect their own chunksize argument.
xref #4443
Test Results
35 files 35 suites 1h 15m 40s :stopwatch: 10 051 tests 7 873 :heavy_check_mark: 2 178 :zzz: 0 :x: 36 686 runs 28 308 :heavy_check_mark: 8 378 :zzz: 0 :x:
Results for commit 68298f1a.
:recycle: This comment has been updated with latest results.
Codecov Report
Merging #4454 (3520b3e) into master (fced465) will decrease coverage by
0.41%. The diff coverage is91.44%.
@@ Coverage Diff @@
## master #4454 +/- ##
==========================================
- Coverage 92.41% 92.00% -0.42%
==========================================
Files 188 188
Lines 20362 20499 +137
Branches 2780 2800 +20
==========================================
+ Hits 18818 18860 +42
- Misses 1163 1256 +93
- Partials 381 383 +2
| Impacted Files | Coverage Δ | |
|---|---|---|
| ibis/expr/types/generic.py | 91.02% <ø> (ø) |
|
| ibis/util.py | 63.21% <76.92%> (+1.10%) |
:arrow_up: |
| ibis/backends/duckdb/__init__.py | 86.07% <82.92%> (-1.22%) |
:arrow_down: |
| ibis/backends/base/__init__.py | 89.61% <94.11%> (+0.72%) |
:arrow_up: |
| ibis/backends/datafusion/__init__.py | 77.77% <95.00%> (-4.05%) |
:arrow_down: |
| ibis/backends/base/sql/__init__.py | 90.90% <100.00%> (+1.38%) |
:arrow_up: |
| ibis/backends/pandas/__init__.py | 87.71% <100.00%> (+2.81%) |
:arrow_up: |
| ibis/backends/pyarrow/datatypes.py | 81.81% <100.00%> (+4.04%) |
:arrow_up: |
| ibis/expr/types/core.py | 93.80% <100.00%> (+0.34%) |
:arrow_up: |
| ibis/backends/snowflake/datatypes.py | 36.36% <0.00%> (-53.04%) |
:arrow_down: |
| ... and 6 more |
Hey @kszucs -- this is ready for another look
Adds to_pyarrow and to_pyarrow_batches to the BaseBackend.
to_pyarrow returns pyarrow objects consistent with the dimension of
the output:
- a table -> pa.Table
- a column -> pa.Array
- a scalar -> pa.Scalar
to_pyarrow_batches returns a RecordBatchReader that returns batches of
pyarrow tables. It does not have the same dimension handling because
that is not available in RecordBatchReaders.
to_pyarrow_batches is implemented for AlchemyBackend, datafusion,
and duckdb.
The pandas backend has to_pyarrow implemented using
pandas.DataFrame.to_pyarrow().
Backends that do not require pyarrow already will only require it when
using to_pyarrow* methods.
There are warnings on these methods to indicate that they are experimental and that they may break in the future irrespective of semantic versioning.
The DuckDB to_pyarrow_batches makes use of a proxy object to escape
garbage collection so that the underlying record batches are still
available even after the cursor used to generate them would have
been garbage collected (but isn't because it is embedded in the proxy
object)
to_pyarrow_batches is implemented for scalars even though it's ambiguous, because to_pyarrow makes use of the RecordBatchReader from to_pyarrow_batches to output tables (and arrays, and scalars). If it is called on a Scalar, it returns a RecordBatchReader that has a single batch that resolves to a pyarrow table with one row and one column.
Thanks for the updates and the summary, going to have a more thorough look tomorrow.
@kszucs If you're happy with this can you merge it?