ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: add to_pyarrow and to_pyarrow_batches

Open gforsyth opened this issue 3 years ago • 1 comments

Adds to_pyarrow and to_pyarrow_batches to the alchemy backends, datafusion, and pandas. More to come.

Some open questions / issues: Where should the schema inference stuff live? The type mapping is already defined in the pyarrow backend but it feels weird to import that backend into other backends.

chunk_size? chunksize? batch_size?

In this context, are chunks / batches properly defined in terms of rows or is that a conflation?

Datafusion has some slightly wonky behavior when it comes to a consistent schema across recordbatches DuckDB does not seem to respect their own chunksize argument.

xref #4443

gforsyth avatar Sep 02 '22 16:09 gforsyth

Test Results

       35 files         35 suites   1h 15m 40s :stopwatch: 10 051 tests   7 873 :heavy_check_mark: 2 178 :zzz: 0 :x: 36 686 runs  28 308 :heavy_check_mark: 8 378 :zzz: 0 :x:

Results for commit 68298f1a.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Sep 02 '22 17:09 github-actions[bot]

Codecov Report

Merging #4454 (3520b3e) into master (fced465) will decrease coverage by 0.41%. The diff coverage is 91.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4454      +/-   ##
==========================================
- Coverage   92.41%   92.00%   -0.42%     
==========================================
  Files         188      188              
  Lines       20362    20499     +137     
  Branches     2780     2800      +20     
==========================================
+ Hits        18818    18860      +42     
- Misses       1163     1256      +93     
- Partials      381      383       +2     
Impacted Files Coverage Δ
ibis/expr/types/generic.py 91.02% <ø> (ø)
ibis/util.py 63.21% <76.92%> (+1.10%) :arrow_up:
ibis/backends/duckdb/__init__.py 86.07% <82.92%> (-1.22%) :arrow_down:
ibis/backends/base/__init__.py 89.61% <94.11%> (+0.72%) :arrow_up:
ibis/backends/datafusion/__init__.py 77.77% <95.00%> (-4.05%) :arrow_down:
ibis/backends/base/sql/__init__.py 90.90% <100.00%> (+1.38%) :arrow_up:
ibis/backends/pandas/__init__.py 87.71% <100.00%> (+2.81%) :arrow_up:
ibis/backends/pyarrow/datatypes.py 81.81% <100.00%> (+4.04%) :arrow_up:
ibis/expr/types/core.py 93.80% <100.00%> (+0.34%) :arrow_up:
ibis/backends/snowflake/datatypes.py 36.36% <0.00%> (-53.04%) :arrow_down:
... and 6 more

codecov[bot] avatar Sep 22 '22 14:09 codecov[bot]

Hey @kszucs -- this is ready for another look

gforsyth avatar Sep 27 '22 20:09 gforsyth

Adds to_pyarrow and to_pyarrow_batches to the BaseBackend.

to_pyarrow returns pyarrow objects consistent with the dimension of the output:

  • a table -> pa.Table
  • a column -> pa.Array
  • a scalar -> pa.Scalar

to_pyarrow_batches returns a RecordBatchReader that returns batches of pyarrow tables. It does not have the same dimension handling because that is not available in RecordBatchReaders.

to_pyarrow_batches is implemented for AlchemyBackend, datafusion, and duckdb.

The pandas backend has to_pyarrow implemented using pandas.DataFrame.to_pyarrow().

Backends that do not require pyarrow already will only require it when using to_pyarrow* methods.

There are warnings on these methods to indicate that they are experimental and that they may break in the future irrespective of semantic versioning.

The DuckDB to_pyarrow_batches makes use of a proxy object to escape garbage collection so that the underlying record batches are still available even after the cursor used to generate them would have been garbage collected (but isn't because it is embedded in the proxy object)

to_pyarrow_batches is implemented for scalars even though it's ambiguous, because to_pyarrow makes use of the RecordBatchReader from to_pyarrow_batches to output tables (and arrays, and scalars). If it is called on a Scalar, it returns a RecordBatchReader that has a single batch that resolves to a pyarrow table with one row and one column.

gforsyth avatar Sep 29 '22 18:09 gforsyth

Thanks for the updates and the summary, going to have a more thorough look tomorrow.

kszucs avatar Sep 29 '22 22:09 kszucs

@kszucs If you're happy with this can you merge it?

cpcloud avatar Oct 05 '22 11:10 cpcloud