delta-rs
delta-rs copied to clipboard
add statistics to arrow datasets
I am using delta_rs python with DuckDB, it works rather and thanks for that, but I did notice that it is substantially slower than reading directly from Parquet, I asked the DuckDB devs and their answer is that arrow dataset does not contains any statistics, so a lot of queries that involved join orders for example become problematics
is there a way to fix that ?
@djouallah Did the DuckDB devs say how external libraries can provide statistics to DuckDB? We have various kinds of statistics; I just don't know how to provide them to DuckDB and which it needs for join orders.
As far as I know the flow is Deltalake->Arrow->DuckDB, if they need stats in DuckDB does it mean they are not pushing down any query to Deltalake? I don't know if there is a way to add stats to pyarrow dataframe.
pyarrow Dataset
s are made up of Fragments
(I think thats what they call them), which for us corresponds to files. These can also contain statistics, but this should already be set. IF we pass a reader to DuckDB, I think the only way is to actually iterate through all the batches. IN case of table, all is materialized.
So as far as I know, the Dataset
generated by delta-rs should contain statistics, if DuckDB ran leverage them in this form is a different question :).
@djouallah, can you confirm you used to_pyarrow_dataset
for this?
Polars is able to do pushdowns through pyarrow dataset, so theoretically duckdb should be able to do that as well
My interpretation of "their answer is that arrow dataset does not contains any statistics" is that it's not about pushdown but about surfacing information to their query planner. For example, they would consider the estimated number of rows in each source table to determine order. So I think this is separate from statistics-based page / file pruning, which we do indeed perform.
Yes exactly
Understood. In DataFusion we have this information here: https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.Statistics.html
In DuckDB the closest thing I can find is: https://github.com/duckdb/duckdb/blob/a00b28f5d453ff7ec3b3837385f083d0887124ad/src/include/duckdb/storage/statistics/node_statistics.hpp#L16
I don't think Polars has any notion of this.
I'll think about ways this could be put in some abstraction that can be read by DuckDB.
link to duckdb devs comments https://github.com/duckdb/duckdb/discussions/4636#discussioncomment-7497063