presto
presto copied to clipboard
RFC-0005. Implementation Scalar function stats propagation, Phase 1
Description
- Support for annotating functions with both constant stats and propagating source stats.
- Added tests for the same.
- Added Scalar stats calculation based on annotation and tests for the same.
- Introduced the feature flag and session flag.
Not added SQLInvokedScalarFunctions. Not annotated builtin functions, as that is covered in next implementation phase. Not added C++ changes as this phase only covers Java side of changes.
Motivation and Context
https://github.com/prestodb/rfcs/blob/main/RFC-0005-functions-stats.md
Impact
None unless the user chooses to enable the feature via setting the session/feature flag.
A new session flag, scalar_function_stats_propagation_enabled and a new feature config will be introduced i.e. optimizer.scalar-function-stats-propagation-enabled, by setting this session flag or feature flag, this feature can be turned on or off.
Test Plan
Contributor checklist
- [x] Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
- [x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
- [x] Documented new properties (with its default value), SQL syntax, functions, or other functionality.
- [x] If release notes are required, they follow the release notes guidelines.
- [x] Adequate tests were added if applicable.
- [x] CI passed.
Release Notes
Please follow release notes guidelines and fill in the release notes below.
== RELEASE NOTES ==
General Changes
* Add configuration property ``optimizer.scalar-function-stats-propagation-enabled`` and session property ``scalar_function_stats_propagation_enabled`` to enable stats propagation by annotation, supporting `RFC-5 <https://github.com/prestodb/rfcs/blob/main/RFC-0005-functions-stats.md>`_. :pr: `23545`
-
Suggest including documentation for the new configuration property and session property, following the Review and Commit Guidelines.
-
Suggest revision to the release note entry following the Release Notes Guidelines:
== RELEASE NOTES ==
General Changes
* Add configuration property ``optimizer.scalar-function-stats-propagation-enabled`` and session property ``scalar_function_stats_propagation_enabled`` to enable stats propagation by annotation, supporting `RFC-5 <https://github.com/prestodb/rfcs/blob/main/RFC-0005-functions-stats.md>`_. :pr: `23545`
Thank you for the comment @steveburnett , do you have a suggestion on documenting this feature in full detail e.g. developer docs.
Hi @ZacBlanco , can you please take a look!
Thank you for the comment @steveburnett , do you have a suggestion on documenting this feature in full detail e.g. developer docs.
Yes, thanks! If these properties are relevant to Presto, please add entries for them to https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/admin/properties.rst.
If these are Presto C++ (Prestissimo), please add the entries to https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/presto_cpp/features.rst#session-properties and https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/presto_cpp/properties.rst as appropriate.
@aaneja still working on the feedback
@aaneja Please take a look again!
Hi @elharo , thanks for taking a look. Relevant docs and comments are added/improved.
Regarding you comment on why we are using double as the return type for average row size and distinct values count, because these values are actually estimates and may be fractional. Infact for this reason VariableStatsEstimate class has all these fields values as double type. We could have used integer at some places for example getReturnTypeWidth, but that adds complexity of repeated type casting/conversions. Double type also allows us to set the value to NaN, indicating it as an unknown.
EDIT: https://github.com/prestodb/presto/pull/23545#discussion_r1793207848
Hi @elharo is it good to go?
I'm still not sold on using doubles for row counts and the like. Doubles are not just a bigger long and have round off and representation problems integers don't. If these things are really going to exceed the size of a long (which I'm not sure they do) then a BigInteger would be preferred.
BigInteger does not exist in java natively. May be this is a decision that presto project has made much before this PR came in. Will it be a good idea, if you can start a mailing group thread with broader audience?
BigInteger isn't a primitive type, but java.math.BigInteger is available if you need it. Again, I'm not convinced you do. long should work here.
Rebased with master