presto icon indicating copy to clipboard operation
presto copied to clipboard

RFC-0005. Implementation Scalar function stats propagation, Phase 1

Open ScrapCodes opened this issue 1 year ago • 10 comments

Description

  1. Support for annotating functions with both constant stats and propagating source stats.
  2. Added tests for the same.
  3. Added Scalar stats calculation based on annotation and tests for the same.
  4. Introduced the feature flag and session flag.

Not added SQLInvokedScalarFunctions. Not annotated builtin functions, as that is covered in next implementation phase. Not added C++ changes as this phase only covers Java side of changes.

Motivation and Context

https://github.com/prestodb/rfcs/blob/main/RFC-0005-functions-stats.md

Impact

None unless the user chooses to enable the feature via setting the session/feature flag. A new session flag, scalar_function_stats_propagation_enabled and a new feature config will be introduced i.e. optimizer.scalar-function-stats-propagation-enabled, by setting this session flag or feature flag, this feature can be turned on or off.

Test Plan

Contributor checklist

  • [x] Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • [x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • [x] Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • [x] If release notes are required, they follow the release notes guidelines.
  • [x] Adequate tests were added if applicable.
  • [x] CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add configuration property ``optimizer.scalar-function-stats-propagation-enabled`` and session property ``scalar_function_stats_propagation_enabled`` to enable stats propagation by annotation, supporting `RFC-5 <https://github.com/prestodb/rfcs/blob/main/RFC-0005-functions-stats.md>`_. :pr: `23545`

ScrapCodes avatar Aug 29 '24 09:08 ScrapCodes

== RELEASE NOTES ==

General Changes
* Add configuration property ``optimizer.scalar-function-stats-propagation-enabled`` and session property ``scalar_function_stats_propagation_enabled`` to enable stats propagation by annotation, supporting `RFC-5 <https://github.com/prestodb/rfcs/blob/main/RFC-0005-functions-stats.md>`_. :pr: `23545`

steveburnett avatar Aug 29 '24 14:08 steveburnett

Thank you for the comment @steveburnett , do you have a suggestion on documenting this feature in full detail e.g. developer docs.

ScrapCodes avatar Aug 30 '24 07:08 ScrapCodes

Hi @ZacBlanco , can you please take a look!

ScrapCodes avatar Sep 11 '24 09:09 ScrapCodes

Thank you for the comment @steveburnett , do you have a suggestion on documenting this feature in full detail e.g. developer docs.

Yes, thanks! If these properties are relevant to Presto, please add entries for them to https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/admin/properties.rst.

If these are Presto C++ (Prestissimo), please add the entries to https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/presto_cpp/features.rst#session-properties and https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/presto_cpp/properties.rst as appropriate.

steveburnett avatar Sep 11 '24 14:09 steveburnett

@aaneja still working on the feedback

ScrapCodes avatar Sep 26 '24 08:09 ScrapCodes

@aaneja Please take a look again!

ScrapCodes avatar Sep 26 '24 12:09 ScrapCodes

Hi @elharo , thanks for taking a look. Relevant docs and comments are added/improved. Regarding you comment on why we are using double as the return type for average row size and distinct values count, because these values are actually estimates and may be fractional. Infact for this reason VariableStatsEstimate class has all these fields values as double type. We could have used integer at some places for example getReturnTypeWidth, but that adds complexity of repeated type casting/conversions. Double type also allows us to set the value to NaN, indicating it as an unknown.

EDIT: https://github.com/prestodb/presto/pull/23545#discussion_r1793207848

ScrapCodes avatar Sep 30 '24 08:09 ScrapCodes

Hi @elharo is it good to go?

ScrapCodes avatar Oct 16 '24 06:10 ScrapCodes

I'm still not sold on using doubles for row counts and the like. Doubles are not just a bigger long and have round off and representation problems integers don't. If these things are really going to exceed the size of a long (which I'm not sure they do) then a BigInteger would be preferred.

BigInteger does not exist in java natively. May be this is a decision that presto project has made much before this PR came in. Will it be a good idea, if you can start a mailing group thread with broader audience?

ScrapCodes avatar Oct 17 '24 05:10 ScrapCodes

BigInteger isn't a primitive type, but java.math.BigInteger is available if you need it. Again, I'm not convinced you do. long should work here.

elharo avatar Oct 17 '24 10:10 elharo

Rebased with master

ScrapCodes avatar Nov 20 '24 07:11 ScrapCodes