datafusion
datafusion copied to clipboard
Default to collecting statistics when creating LIstingTables
Today, when creating tables of parquet files using CREATE EXTERNAL TABLE or ListingTables, statistics are not gathered.
This is good in the sense that creating the table is fast(er) but subsequent queries might be slower
The behavior is clarified in
- https://github.com/apache/datafusion/pull/16157
@davisp suggests that defaulting to collecting statistics would make more sense (and I agree):
I’ll also note that my personal preference would be to default to true purely because it took a surprising amount of work to figure out how to even report #15908 not knowing that statistics collection was a config option. I do see the rationale around the behavior change, though I’d say either way that flag is defaulted is a behavior change and true seems like a saner default.
Originally posted by @davisp in https://github.com/apache/datafusion/pull/16080#pullrequestreview-2852069846
I suggest we change the default value of datafusion.execution.collect_statistics so statistics are collected on table creation
take
Registering my official +1 to default to collecting statistics.
For reference, I was working on the TPC-H benchmarks with a scale factor of 20 which generates roughly 20GiB of CSV data to process. Without statistics, query 17 or 18 would OOM on a 32GiB machine after about 20s. With statistics it never uses more than 1.9G and finishes in about 5s.
I generally agree that statistics collection is probably a bit slower than not collecting, but my guess is that any folks that notice and/or care would be doing something like high frequency queries against tiny datasets which seems like a niche use case. However, I’d wager a minute amount of money that the benefits would be worth the cost around the 500MiB to 1GiB range (based on my assumption that statistics collection is just scanning row groups, not full table scans).
I am sorry for the delay on a resolution to this issue. I have been busy at work and the workload will remain the same at least for the next 3 weeks so I prefer to release this issue for someone else to take it and not become a blocker.
Thanks @brayanjuls -- hope all goes well at work
Is there anything to this issue besides changing the default datafusion.execution.collect_statistics, fixing any tests that rely on the default value being false and and changing docs accordingly? Would be glad to take it, just wondering if there's anything I'm missing.
Is there anything to this issue besides changing the default
datafusion.execution.collect_statistics, fixing any tests that rely on the default value beingfalseand and changing docs accordingly? Would be glad to take it, just wondering if there's anything I'm missing.
I think that just about covers the issue
https://github.com/apache/datafusion/pull/16447 is merged so closing this
@AdamGS / @blaginin do you think we should backport this to 48.0.1?
- https://github.com/apache/datafusion/issues/16486 I kind of think maybe we should to reduce churn
But then again it seems like a big change for a patch set
I am torn
It's technically reverse the codebase to the previous state, so IMO should be fine to backport? Happy to create a PR to df-48
It's technically reverse the codebase to the previous state, so IMO should be fine to backport? Happy to create a PR to df-48
I think technically it reverses the DataFrame API back to the previous state (in 47.0.0) but changes the default for CREATE EXTERNAL TABLE
How about we make a backport PR and discuss further there?
I was testing this out recently and it still doesn't appear to be working quite as expected for some reason:
- https://github.com/apache/datafusion/issues/18952