gpdb
gpdb copied to clipboard
Add GUC to enable HLL collection during analyze
Currently, analyze will only collect HLL specific stats if it is a leaf partition. This presents an issue if an ordinary table is analyzed, then exchanged or added to a partition table. The leaf partition in this case won't have HLL stats, and this information won't be merged into the root partition, which can cause poor Orca plans.
This commit adds a GUC, optimizer_analyze_always_collect_hll, that when enabled will also collect HLL stats for ordinary tables during analyze.
A couple other approaches were considered here.
-
Always collect HLL stats. This can cause catalog bloat, as the HLL value is quite wide and wouldn't be used in ordinary tables until it's part of a partitioned table.
-
Analyze/collect stats during exchange, with the assumption that the overhead from analyzing would be much less than the constraint checks that partitions already undergo when exchanged/added. However, in the use case we're looking at, the user disables this validation, so the overhead would be noticeable.
-
An additional syntax option in analyze. This wouldn't be able to be backported to 6X, but I believe this is also a valid option for the master branch.
Analyze/collect stats during exchange, with the assumption that the overhead from analyzing would be much less than the constraint checks that partitions already undergo when exchanged/added. However, in the use case we're looking at, the user disables this validation, so the overhead would be noticeable.
Interesting. Just for info, providing this option in first place wasn't good thing. With GPDB7 and forward no such options/choice is provided which can lead to inconsistent/wrong data to land in tables. So, for master branch given no such option exists, seem would be better to perform the analyze action as part of EXCHANGE instead of this separate external manual work-around?
-
How is the behavior - does user every-time have to analyze with
optimizer_analyze_always_collect_hllordinary table if required to collect this info or its first time only required and if HLL stats are currently present for the table next time with or without this GUC it will continue to collect the HLL stats? -
If GUC is required every-time can it lead to weird behaviors where once the tables was analyzed with
optimizer_analyze_always_collect_hlland next time without - then can have mixed/stale stat state?
Analyze/collect stats during exchange, with the assumption that the overhead from analyzing would be much less than the constraint checks that partitions already undergo when exchanged/added. However, in the use case we're looking at, the user disables this validation, so the overhead would be noticeable.
Interesting. Just for info, providing this option in first place wasn't good thing. With GPDB7 and forward no such options/choice is provided which can lead to inconsistent/wrong data to land in tables. So, for master branch given no such option exists, seem would be better to perform the analyze action as part of EXCHANGE instead of this separate external manual work-around?
Ah thanks, I wasn't aware of this. I agree, if there's no way to skip validation then we should do it during exchange, and I have that code working in another branch. We'll sync offline to discuss user requirements/expectations here. I'll mark this PR as a draft for now.
- How is the behavior - does user every-time have to analyze with
optimizer_analyze_always_collect_hllordinary table if required to collect this info or its first time only required and if HLL stats are currently present for the table next time with or without this GUC it will continue to collect the HLL stats?
Currently the GUC would be required whenever we want an ordinary table to collect HLL stats. The most likely use case for this GUC would likely be set at a database/cluster level.
- If GUC is required every-time can it lead to weird behaviors where once the tables was analyzed with
optimizer_analyze_always_collect_hlland next time without - then can have mixed/stale stat state?
Yes, in that case it would have stale stats. If we make it so it's only needed the first time, is there a case where a user may want to go back to not collecting HLL? At the same time, we want the effect of the GUC to be easy for users to understand.
- How is the behavior - does user every-time have to analyze with
optimizer_analyze_always_collect_hllordinary table if required to collect this info or its first time only required and if HLL stats are currently present for the table next time with or without this GUC it will continue to collect the HLL stats?Currently the GUC would be required whenever we want an ordinary table to collect HLL stats. The most likely use case for this GUC would likely be set at a database/cluster level.
I don't think setting at database/cluster level is desired as if that was the case we would have just enabled the collection for all non-partitioned tables. Seems to need to restrict to table level option. So, for GPDB7 and forward having it a reloption essentially feels similar to current Storage Parameters like fillfactor, parallel_workers and such...
- If GUC is required every-time can it lead to weird behaviors where once the tables was analyzed with
optimizer_analyze_always_collect_hlland next time without - then can have mixed/stale stat state?Yes, in that case it would have stale stats. If we make it so it's only needed the first time, is there a case where a user may want to go back to not collecting HLL? At the same time, we want the effect of the GUC to be easy for users to understand.
The main reason for my question is centered around how will things work for auto-stats (I believe the GUC setting at session level will still help here) or auto-analyze (which is a background job) where the GUC won't be or can't be set. So, during these invocations of analyze won't be able to do the job.
I think we at least maybe need WARNING to be emitted where HLL stats existed for the table before and the new analyze won't record the same to inform users that HLL stats would be missing.
Thanks for the comments! Closing this PR in favor of adding it as a reloption as suggested by @ashwinstar. That way, we don't need a GUC and it will be more predictable when autostats/autoanalyze is enabled.