presto icon indicating copy to clipboard operation
presto copied to clipboard

Add histograms for optimizer cost calculation

Open ZacBlanco opened this issue 1 year ago • 3 comments

Description

This commits provides two critical changes:

  1. Adds a new enum value to ColumnStatisticType: Histogram.
  2. Utilizes the new histograms in optimizer's cost calculations

Histogram SPI Additions

With this change, a new column statistic type for histograms is introduced. In addition, a new SPI class ConnectorHistogram is also introduced. This interface is designed to be able to be implemented by either the connectors or in the main presto codebase. This should allow connectors to return histogram statistics in any format regardless of the source as long as they implement the interface.

The API is straightforward and includes 2 methods.

  • cumulativeProbability(double value, boolean inclusive): -> CDF function
  • inverseCumulativeProbability(double probability) -> inverse CDF function

A simple reference implementation is provided inside of UniformDistributionHistogram. This implementation results in the same logic and same plans as the previous cost-based calculations. When the optimizer uses it, the calculations end up with the same number as prior to this PR, but now just utilizing the histogram API.

Additionally, to propagate statistics further up into a plan, another implementation of histograms is provided inside of DomainConstrainedHistogram. This class is used to bound a source histogram with a given domain as additional filters may be applied further up in the plan.

Cost Calculations

Previously all cost calculations were performed inside of the ComparisonStatsCalculator using logic from the StatisticRange class to calculate the filter factors of overlapping and intersecting ranges. Instead, this change introduces cost calculation using the new histogram model and API. The core of the filter proportion calculation logic exists in the new HistogramCalculator utility class.

If the underlying Histogram implementation is swapped from the UniformDistributionHistogram, then the stats calculator will calculate the costs using the histogram information correctly.

If for some reason the new logic breaks existing plans for queries outside of the TPC-H/TPC-DS tests we have in the codebase, a new session flag optimizer.use-histograms is provided in order to fall back on the old logic.

Testing

I verified through additional unit tests and the existing testing infrastructure that the implementation here results in the same cost and filter proportion calculations as the previous logic.

I also did a few small runs with TPC-DS SF1 on a real cluster with and without the cost calculations to verify there were no serious regressions

Motivation and Context

By using histograms for cost calculations, we can get much more accurate predictions on how predicates will affect the output of a node, potentially resulting in more cost-efficient plans.

Impact

  1. A new value in the ColumnStatisticType enum: HISTOGRAM
  2. A new SPI interface to be implemented by connectors or inside presto-main: ConnectorHistogram

Test Plan

  1. Use existing test coverage to ensure no regressions
  2. Extensive unit testing for corner cases without statistics (a majority of cases) to ensure no regressions

Contributor checklist

  • [x] Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • [x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • [x] Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • [x] If release notes are required, they follow the release notes guidelines.
  • [x] Adequate tests were added if applicable.
  • [x] CI passed.

Release Notes

== RELEASE NOTES ==

General Changes
* Histograms can now be collected by presto if connectors support them.

ZacBlanco avatar Oct 24 '23 21:10 ZacBlanco

A few TODOs just to give an idea of where I want to improve this PR:

  1. ~Add a configuration property to choose to use histogram calculation over the previous logic - potentially with a log message when configured if the new logic is different than the old logic?~
  2. ~Clean up the logic inside of HistogramCalculator. The existing implementation is somewhat hacky and I did what I needed to do in order to pass the unit tests, but I think there are some easy simplifications of the logic. One of the challenges is that by using Estimate as the result for mins/max we lose some granularity as we don't know if an Unknown value is Infinite or NaN, which caused the logic to become quite a bit more complex than in the StatisticRange class.~

These have been completed

ZacBlanco avatar Oct 24 '23 22:10 ZacBlanco

The header check is failing on presto-benchto-benchmarks/src/test/resources/sql/presto/tpcds/q85.plan.txt. This file never had a license header and is only used for checking test output. I think the check can be safely ignored.

ZacBlanco avatar Dec 15 '23 01:12 ZacBlanco

I ran a quick experiment using TPC-DS SF1 on a real cluster to verify the performance does not degrade. image

I've also provided the raw data for these plots

tpcds-sf1-results.tar.gz

ZacBlanco avatar Feb 03 '24 03:02 ZacBlanco

Just some nits and high level questions. I think the header check is failing again btw

aaneja avatar Feb 29 '24 18:02 aaneja

... I think the header check is failing again btw

I reverted the commit that originally fixed the header check. #21875 was filed for it. Deepak said he would address it but that hasn't been completed yet. For now, We should be able to ignore the ci/header-check.

ZacBlanco avatar Feb 29 '24 18:02 ZacBlanco

Suggest rephrasing release note entry to something like the following, based on the Order of changes in the release notes guidelines:

== RELEASE NOTES ==

General Changes
* Add histogram ColumnStatisticType to Presto for the optimizer and connectors that support them

steveburnett avatar Feb 29 '24 19:02 steveburnett

From a testing perspective, I think we should add some E2E SQL tests (ie presto-tests) - these can use the existing abstract tests but with optimizer_use_histograms turned on to prove no regressions Additionally, we could add some presto-main unit tests targeted towards filter & join selectivity. If we can augment the plan matcher to match on node cardinality, we can add tighter assertions w.r.t selectivity with and without histograms

aaneja avatar Mar 01 '24 15:03 aaneja

Currently, I have enabled this feature by default because it should be backwards compatible in nearly all cases. So the tests in presto-main and presto-tests should be running with this on already.

Additionally, we could add some presto-main unit tests targeted towards filter & join selectivity. If we can augment the plan matcher to match on node cardinality, we can add tighter assertions w.r.t selectivity with and without histograms

I think it's probably a good idea to check that plans are equivalent with the flag on/off but I feel the gain is marginal given that we already have many unit tests as well as the TPC-DS/H cost plan tests. I feel it might be better to add similar tests to the TPC-DS/H cost plan tests but check the cost estimates rather than plan shape?

ZacBlanco avatar Mar 01 '24 17:03 ZacBlanco

My bad - I didn't realize histograms option was defaulting to ON. Should we be shipping with this turned on though ? There have been regressions due to optimizer features, so it's best to have users opt-in and report results to us rather than surprise them

I feel it might be better to add similar tests to the TPC-DS/H cost plan tests but check the cost estimates rather than plan shape

Yes this is exactly what I meant by matching on node cardinality (row counts). If we augment the PlanMatchPattern and related classes to match for row counts, we could add a planner test that does strong assertions on expected cardinality

aaneja avatar Mar 02 '24 08:03 aaneja

IMO, setting it to on by default would be ideal. My worry is that people might forget to set the property once we introduce support in the connectors. It would make for a confusing situation where even if the metastore has histograms stored and generated by presto that they aren't utilized.

I understand we tend to tend to be conservative with these types so I've updated the default to off. In a later PR where I intend to introduce the support in Iceberg, I will make sure the documentation very clearly states the session property should be set to true to take advantage of the histograms.


Re: testing

Now that the default is off, I've explicitly set up some tests to run in a parameterized way such that the histogram and non-histogram code paths are stressed.

See:

  • AbstractCostBasedPlanTest
  • AbstractTestComparisonStatsCalculator
  • AbstractTestFilterStatsCalculator

These suites are where I had to make changes previously due to cost changes when histograms defaulted to on, so the new tests changes should ensure that we don't have any major regressions.

ZacBlanco avatar Mar 06 '24 19:03 ZacBlanco

@aaneja and I had a discussion about the disjoint histogram which had us asking the question about what Spark does for filter estimation in OR cases when histograms are involved.

The relevant block of code is here: https://github.com/apache/spark/blob/ce93c9fd86715e2479552628398f6fc11e83b2af/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala#L82-L121

They don't change the logic when histograms are involved or not. Basically the filter factor is always calculated as percent1 + percent2 - (percent1 * percent2). There is a separate chunk of code for calculations using histograms if they exist

  • For numeric comparisons: https://github.com/apache/spark/blob/ce93c9fd86715e2479552628398f6fc11e83b2af/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala#L493-L528
  • For equality: https://github.com/apache/spark/blob/ce93c9fd86715e2479552628398f6fc11e83b2af/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala#L555-L581

ZacBlanco avatar Mar 18 '24 18:03 ZacBlanco

Overall looks good, @tdcmeehan can you please review the SPI changes. This PR doesn't include loading the histograms from connectors and the SPI changes are currently noop. So, I'm uncertain if this PR is the most appropriate for these changes.

jaystarshot avatar Mar 23 '24 00:03 jaystarshot

LGTM.

Just to confirm, current CBO will not be affected with this feature to be off?

Yes, see: https://github.com/prestodb/presto/pull/21236#discussion_r1540381711

ZacBlanco avatar Mar 27 '24 02:03 ZacBlanco

@ZacBlanco @tdcmeehan We observe latency regression/query timeout caused by this PR, detail in https://github.com/prestodb/presto/pull/22661, and need to revert this PR

feilong-liu avatar May 03 '24 17:05 feilong-liu