datafusion-python icon indicating copy to clipboard operation
datafusion-python copied to clipboard

Support reading from PyArrow datasets

Open wjones127 opened this issue 3 years ago • 1 comments

Given the success of the Datasets + DuckDB integration, a similar integration might be worthwhile in this module.

The datasets API allows taking filters and columns subset, and provides an iterator of Arrow record batches. I think that could be wrapped in a TableProvider, though I'm unclear how predicate pushdown is implemented in Datafusion.

wjones127 avatar Jan 09 '22 00:01 wjones127

Predicate pushdown is supported as an argument for the scan method, the doc you linked is out of date, you should see that argument in the latest version: https://docs.rs/datafusion/latest/datafusion/datasource/datasource/trait.TableProvider.html#tymethod.scan.

houqp avatar Jan 09 '22 19:01 houqp