presto
presto copied to clipboard
Add histograms for optimizer cost calculation
Description
This commits provides two critical changes:
- Adds a new enum value to ColumnStatisticType: Histogram.
- Utilizes the new histograms in optimizer's cost calculations
Histogram SPI Additions
With this change, a new column statistic type for histograms is introduced. In addition, a new SPI class ConnectorHistogram
is also introduced. This interface is designed to be able to be implemented by either the connectors or in the main presto codebase. This should allow connectors to return histogram statistics in any format regardless of the source as long as they implement the interface.
The API is straightforward and includes 2 methods.
- cumulativeProbability(double value, boolean inclusive): -> CDF function
- inverseCumulativeProbability(double probability) -> inverse CDF function
A simple reference implementation is provided inside of UniformDistributionHistogram. This implementation results in the same logic and same plans as the previous cost-based calculations. When the optimizer uses it, the calculations end up with the same number as prior to this PR, but now just utilizing the histogram API.
Additionally, to propagate statistics further up into a plan, another implementation of histograms is provided inside of DomainConstrainedHistogram. This class is used to bound a source histogram with a given domain as additional filters may be applied further up in the plan.
Cost Calculations
Previously all cost calculations were performed inside of the ComparisonStatsCalculator using logic from the StatisticRange class to calculate the filter factors of overlapping and intersecting ranges. Instead, this change introduces cost calculation using the new histogram model and API. The core of the filter proportion calculation logic exists in the new HistogramCalculator utility class.
If the underlying Histogram implementation is swapped from the UniformDistributionHistogram, then the stats calculator will calculate the costs using the histogram information correctly.
If for some reason the new logic breaks existing plans for queries outside of the TPC-H/TPC-DS tests we have in the codebase, a new session flag optimizer.use-histograms
is provided in order to fall back on the old logic.
Testing
I verified through additional unit tests and the existing testing infrastructure that the implementation here results in the same cost and filter proportion calculations as the previous logic.
I also did a few small runs with TPC-DS SF1 on a real cluster with and without the cost calculations to verify there were no serious regressions
Motivation and Context
By using histograms for cost calculations, we can get much more accurate predictions on how predicates will affect the output of a node, potentially resulting in more cost-efficient plans.
Impact
- A new value in the
ColumnStatisticType
enum:HISTOGRAM
- A new SPI interface to be implemented by connectors or inside
presto-main
:ConnectorHistogram
Test Plan
- Use existing test coverage to ensure no regressions
- Extensive unit testing for corner cases without statistics (a majority of cases) to ensure no regressions
Contributor checklist
- [x] Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
- [x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
- [x] Documented new properties (with its default value), SQL syntax, functions, or other functionality.
- [x] If release notes are required, they follow the release notes guidelines.
- [x] Adequate tests were added if applicable.
- [x] CI passed.
Release Notes
== RELEASE NOTES ==
General Changes
* Histograms can now be collected by presto if connectors support them.
A few TODOs just to give an idea of where I want to improve this PR:
- ~Add a configuration property to choose to use histogram calculation over the previous logic - potentially with a log message when configured if the new logic is different than the old logic?~
- ~Clean up the logic inside of
HistogramCalculator
. The existing implementation is somewhat hacky and I did what I needed to do in order to pass the unit tests, but I think there are some easy simplifications of the logic. One of the challenges is that by usingEstimate
as the result for mins/max we lose some granularity as we don't know if an Unknown value is Infinite or NaN, which caused the logic to become quite a bit more complex than in the StatisticRange class.~
These have been completed
The header check is failing on presto-benchto-benchmarks/src/test/resources/sql/presto/tpcds/q85.plan.txt
. This file never had a license header and is only used for checking test output. I think the check can be safely ignored.
I ran a quick experiment using TPC-DS SF1 on a real cluster to verify the performance does not degrade.
I've also provided the raw data for these plots
Just some nits and high level questions. I think the header check is failing again btw
... I think the header check is failing again btw
I reverted the commit that originally fixed the header check. #21875 was filed for it. Deepak said he would address it but that hasn't been completed yet. For now, We should be able to ignore the ci/header-check
.
Suggest rephrasing release note entry to something like the following, based on the Order of changes in the release notes guidelines:
== RELEASE NOTES ==
General Changes
* Add histogram ColumnStatisticType to Presto for the optimizer and connectors that support them
From a testing perspective, I think we should add some E2E SQL tests (ie presto-tests
) - these can use the existing abstract tests but with optimizer_use_histograms
turned on to prove no regressions
Additionally, we could add some presto-main unit tests targeted towards filter & join selectivity. If we can augment the plan matcher to match on node cardinality, we can add tighter assertions w.r.t selectivity with and without histograms
Currently, I have enabled this feature by default because it should be backwards compatible in nearly all cases. So the tests in presto-main
and presto-tests
should be running with this on already.
Additionally, we could add some presto-main unit tests targeted towards filter & join selectivity. If we can augment the plan matcher to match on node cardinality, we can add tighter assertions w.r.t selectivity with and without histograms
I think it's probably a good idea to check that plans are equivalent with the flag on/off but I feel the gain is marginal given that we already have many unit tests as well as the TPC-DS/H cost plan tests. I feel it might be better to add similar tests to the TPC-DS/H cost plan tests but check the cost estimates rather than plan shape?
My bad - I didn't realize histograms option was defaulting to ON. Should we be shipping with this turned on though ? There have been regressions due to optimizer features, so it's best to have users opt-in and report results to us rather than surprise them
I feel it might be better to add similar tests to the TPC-DS/H cost plan tests but check the cost estimates rather than plan shape
Yes this is exactly what I meant by matching on node cardinality (row counts). If we augment the PlanMatchPattern and related classes to match for row counts, we could add a planner test that does strong assertions on expected cardinality
IMO, setting it to on by default would be ideal. My worry is that people might forget to set the property once we introduce support in the connectors. It would make for a confusing situation where even if the metastore has histograms stored and generated by presto that they aren't utilized.
I understand we tend to tend to be conservative with these types so I've updated the default to off. In a later PR where I intend to introduce the support in Iceberg, I will make sure the documentation very clearly states the session property should be set to true
to take advantage of the histograms.
Re: testing
Now that the default is off, I've explicitly set up some tests to run in a parameterized way such that the histogram and non-histogram code paths are stressed.
See:
- AbstractCostBasedPlanTest
- AbstractTestComparisonStatsCalculator
- AbstractTestFilterStatsCalculator
These suites are where I had to make changes previously due to cost changes when histograms defaulted to on, so the new tests changes should ensure that we don't have any major regressions.
@aaneja and I had a discussion about the disjoint histogram which had us asking the question about what Spark does for filter estimation in OR
cases when histograms are involved.
The relevant block of code is here: https://github.com/apache/spark/blob/ce93c9fd86715e2479552628398f6fc11e83b2af/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala#L82-L121
They don't change the logic when histograms are involved or not. Basically the filter factor is always calculated as percent1 + percent2 - (percent1 * percent2)
. There is a separate chunk of code for calculations using histograms if they exist
- For numeric comparisons: https://github.com/apache/spark/blob/ce93c9fd86715e2479552628398f6fc11e83b2af/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala#L493-L528
- For equality: https://github.com/apache/spark/blob/ce93c9fd86715e2479552628398f6fc11e83b2af/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala#L555-L581
Overall looks good, @tdcmeehan can you please review the SPI changes. This PR doesn't include loading the histograms from connectors and the SPI changes are currently noop. So, I'm uncertain if this PR is the most appropriate for these changes.
LGTM.
Just to confirm, current CBO will not be affected with this feature to be off?
Yes, see: https://github.com/prestodb/presto/pull/21236#discussion_r1540381711
@ZacBlanco @tdcmeehan We observe latency regression/query timeout caused by this PR, detail in https://github.com/prestodb/presto/pull/22661, and need to revert this PR