calcite icon indicating copy to clipboard operation
calcite copied to clipboard

[CALCITE-6830] AVG() returns double precision by default when its argument type is INT_TYPES

Open LantaoJin opened this issue 11 months ago • 7 comments

To reproduce (agg.iq)

!use post
!set outputformat mysql

SELECT avg(deptno) as a FROM emp;

Should return 22.1429, but returns 22.

The current default logic of Calcite setting the return type to be same as the input column. For example, if the input column is INT, the return type for AVG() is INT. This behavior is different from other databases. For example:

Postgres

numeric for any integer type argument, double precision for a floating-point argument, otherwise the same as the argument data type

MySQL

The SUM() and AVG() functions return a DECIMAL value for exact-value arguments (integer or DECIMAL), and a DOUBLE value for approximate-value arguments (FLOAT or DOUBLE).

Presto

avg(x) → double. (https://prestodb.io/docs/current/functions/aggregate.html)

Besides, since the AVG agg with integer argument returns integer type, the behaviors of var_pop, var_samp, stddev_pop, stddev_samp and stddev are same as avg.

This PR targets to change the default return type of avg to double precision when its argument type are TINYINT, SMALLINT, INTEGER or BIGINT.

LantaoJin avatar Feb 13 '25 08:02 LantaoJin

DOUBLE is almost never the right type in SQL, since computations on double values are in general non-deterministic. If any type is right, a variant of DECIMAL should be used for AVG. Can you propose an algorithm to choose the precision and scale for the result? Ideally Calcite would follow what other databases do - at least databases that have proper DECIMAL types; the Postgres DECIMAL type is not a standard one, for example.

For functions like STDDEV it's a bit more complicated, since they involve a square root at the last step.

Moreover, this change is backwards-incompatible, and it may cause problems for all projects that have relied on this behavior. At the very least this change should be documented in history.md as such.

mihaibudiu avatar Feb 13 '25 17:02 mihaibudiu

DOUBLE is almost never the right type in SQL, since computations on double values are in general non-deterministic. If any type is right, a variant of DECIMAL should be used for AVG. Can you propose an algorithm to choose the precision and scale for the result?

The CUME_DIST and PERCENT_RANK return DOUBLE type in Calcite too. And in Calcite, the default precision is 15 for DOUBLE, 17 for DECIMAL. Will using the default DOUBLE precision in AVG result in non-deterministic? If not, any specific reason to use DECIMAL for AVG? In Postgres, the DOUBLE precision for AVG calculated based on the input values as follows (not quite sure for now):

precision = max(precision(input_values) + ceil(log10(count(input_values))), scale(input_values) + decimal_places)

Not sure what precision algorithm would be the best, but definitely not return integer for AVG by default (with INT_TYPES argument).

LantaoJin avatar Feb 14 '25 09:02 LantaoJin

In general, the result of double addition aggregation depends on the order in which the numbers are processed, whereas the addition of decimals gives always the same result.

It's true that this can only happen for very large integers, since double has 53 bits of mantissa. So any integer value which requires less than 53 bits will be represented exactly, and if all intermediate addition results are less than 53 bits the result will be deterministic and exact; the only imprecision will be produced by the final division. The result of division in FP and DECIMAL can also be slightly different. In general for financial-type computations you should always use DECIMAL.

Your formula depends on the number of input values, which is not known when you compile the SQL program. Calcite is statically-typed, so it has to choose the type at compile-time.

mihaibudiu avatar Feb 14 '25 18:02 mihaibudiu

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 90 days if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Mar 24 '25 03:03 github-actions[bot]

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 90 days if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Jun 28 '25 03:06 github-actions[bot]

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 90 days if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Oct 25 '25 03:10 github-actions[bot]