datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Default to collecting statistics when creating LIstingTables

Open alamb opened this issue 6 months ago • 7 comments

Today, when creating tables of parquet files using CREATE EXTERNAL TABLE or ListingTables, statistics are not gathered.

This is good in the sense that creating the table is fast(er) but subsequent queries might be slower

The behavior is clarified in

  • https://github.com/apache/datafusion/pull/16157

@davisp suggests that defaulting to collecting statistics would make more sense (and I agree):

I’ll also note that my personal preference would be to default to true purely because it took a surprising amount of work to figure out how to even report #15908 not knowing that statistics collection was a config option. I do see the rationale around the behavior change, though I’d say either way that flag is defaulted is a behavior change and true seems like a saner default.

Originally posted by @davisp in https://github.com/apache/datafusion/pull/16080#pullrequestreview-2852069846

alamb avatar May 22 '25 19:05 alamb

I suggest we change the default value of datafusion.execution.collect_statistics so statistics are collected on table creation

alamb avatar May 22 '25 19:05 alamb

take

brayanjuls avatar May 22 '25 23:05 brayanjuls

Registering my official +1 to default to collecting statistics.

For reference, I was working on the TPC-H benchmarks with a scale factor of 20 which generates roughly 20GiB of CSV data to process. Without statistics, query 17 or 18 would OOM on a 32GiB machine after about 20s. With statistics it never uses more than 1.9G and finishes in about 5s.

I generally agree that statistics collection is probably a bit slower than not collecting, but my guess is that any folks that notice and/or care would be doing something like high frequency queries against tiny datasets which seems like a niche use case. However, I’d wager a minute amount of money that the benefits would be worth the cost around the 500MiB to 1GiB range (based on my assumption that statistics collection is just scanning row groups, not full table scans).

davisp avatar May 25 '25 21:05 davisp

I am sorry for the delay on a resolution to this issue. I have been busy at work and the workload will remain the same at least for the next 3 weeks so I prefer to release this issue for someone else to take it and not become a blocker.

brayanjuls avatar Jun 02 '25 05:06 brayanjuls

Thanks @brayanjuls -- hope all goes well at work

alamb avatar Jun 02 '25 17:06 alamb

Is there anything to this issue besides changing the default datafusion.execution.collect_statistics, fixing any tests that rely on the default value being false and and changing docs accordingly? Would be glad to take it, just wondering if there's anything I'm missing.

AdamGS avatar Jun 18 '25 18:06 AdamGS

Is there anything to this issue besides changing the default datafusion.execution.collect_statistics, fixing any tests that rely on the default value being false and and changing docs accordingly? Would be glad to take it, just wondering if there's anything I'm missing.

I think that just about covers the issue

alamb avatar Jun 18 '25 19:06 alamb

https://github.com/apache/datafusion/pull/16447 is merged so closing this

blaginin avatar Jun 19 '25 18:06 blaginin

@AdamGS / @blaginin do you think we should backport this to 48.0.1?

  • https://github.com/apache/datafusion/issues/16486 I kind of think maybe we should to reduce churn

But then again it seems like a big change for a patch set

I am torn

alamb avatar Jul 02 '25 13:07 alamb

It's technically reverse the codebase to the previous state, so IMO should be fine to backport? Happy to create a PR to df-48

blaginin avatar Jul 02 '25 13:07 blaginin

It's technically reverse the codebase to the previous state, so IMO should be fine to backport? Happy to create a PR to df-48

I think technically it reverses the DataFrame API back to the previous state (in 47.0.0) but changes the default for CREATE EXTERNAL TABLE

How about we make a backport PR and discuss further there?

alamb avatar Jul 02 '25 13:07 alamb

I was testing this out recently and it still doesn't appear to be working quite as expected for some reason:

  • https://github.com/apache/datafusion/issues/18952

alamb avatar Nov 26 '25 19:11 alamb