delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

add statistics to arrow datasets

Open djouallah opened this issue 1 year ago • 8 comments

I am using delta_rs python with DuckDB, it works rather and thanks for that, but I did notice that it is substantially slower than reading directly from Parquet, I asked the DuckDB devs and their answer is that arrow dataset does not contains any statistics, so a lot of queries that involved join orders for example become problematics

is there a way to fix that ?

djouallah avatar Nov 11 '23 10:11 djouallah

@djouallah Did the DuckDB devs say how external libraries can provide statistics to DuckDB? We have various kinds of statistics; I just don't know how to provide them to DuckDB and which it needs for join orders.

wjones127 avatar Nov 11 '23 17:11 wjones127

As far as I know the flow is Deltalake->Arrow->DuckDB, if they need stats in DuckDB does it mean they are not pushing down any query to Deltalake? I don't know if there is a way to add stats to pyarrow dataframe.

r3stl355 avatar Nov 11 '23 19:11 r3stl355

pyarrow Datasets are made up of Fragments (I think thats what they call them), which for us corresponds to files. These can also contain statistics, but this should already be set. IF we pass a reader to DuckDB, I think the only way is to actually iterate through all the batches. IN case of table, all is materialized.

So as far as I know, the Dataset generated by delta-rs should contain statistics, if DuckDB ran leverage them in this form is a different question :).

@djouallah, can you confirm you used to_pyarrow_dataset for this?

roeap avatar Nov 11 '23 19:11 roeap

Polars is able to do pushdowns through pyarrow dataset, so theoretically duckdb should be able to do that as well

ion-elgreco avatar Nov 11 '23 19:11 ion-elgreco

My interpretation of "their answer is that arrow dataset does not contains any statistics" is that it's not about pushdown but about surfacing information to their query planner. For example, they would consider the estimated number of rows in each source table to determine order. So I think this is separate from statistics-based page / file pruning, which we do indeed perform.

wjones127 avatar Nov 11 '23 19:11 wjones127

Yes exactly

djouallah avatar Nov 12 '23 01:11 djouallah

Understood. In DataFusion we have this information here: https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.Statistics.html

In DuckDB the closest thing I can find is: https://github.com/duckdb/duckdb/blob/a00b28f5d453ff7ec3b3837385f083d0887124ad/src/include/duckdb/storage/statistics/node_statistics.hpp#L16

I don't think Polars has any notion of this.

I'll think about ways this could be put in some abstraction that can be read by DuckDB.

wjones127 avatar Nov 12 '23 02:11 wjones127

link to duckdb devs comments https://github.com/duckdb/duckdb/discussions/4636#discussioncomment-7497063

djouallah avatar Nov 12 '23 08:11 djouallah