datafusion-python
datafusion-python copied to clipboard
Support reading from PyArrow datasets
Given the success of the Datasets + DuckDB integration, a similar integration might be worthwhile in this module.
The datasets API allows taking filters and columns subset, and provides an iterator of Arrow record batches. I think that could be wrapped in a TableProvider, though I'm unclear how predicate pushdown is implemented in Datafusion.
Predicate pushdown is supported as an argument for the scan method, the doc you linked is out of date, you should see that argument in the latest version: https://docs.rs/datafusion/latest/datafusion/datasource/datasource/trait.TableProvider.html#tymethod.scan.