What changes were proposed in this pull request?

This PR would disable HiveAggregateReduceFunctionsRule when the type of argument of AVG/STDDEV_(POP|SAMP)/VAR_(POP/SAMP) is a string-like type.

https://issues.apache.org/jira/browse/HIVE-28082

Why are the changes needed?

While testing Hive 4, we found the behavior of those UDFs changed since Hive 2. We identified HiveAggregateReduceFunctionsRule as the root cause. We believe Calcite rules should not change the original semantics.

This PR would disable the rule when it could cause inconsistency. That's because it is not simple to decompose AVG(str) into SUM(str) and COUNT(*), or correctly transform STDDEV_* or VAR_*.

Does this PR introduce any user-facing change?

The result could change if a user uses the UDFs.

Is the change a dependency upgrade?

No

How was this patch tested?

I added/updated test cases. You can see the result of the current master branch through the first commit to add a new test case. Results change with or without CBO.

Feb 20 '24 03:02 okumin

Quality Gate passed

Issues
2 New issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

Feb 21 '24 09:02 sonarqubecloud[bot]

@kasakrisz Thanks for checking. I double-checked the SQL:2023 Part 2.

(A) 10.9 <aggregate function> -> Syntax Rules -> 7) -> g) mentions the specification of SUM and AVG. It says the type of value expression shall be numeric. So, if we strictly adhere to the standard, it could be best to throw an exception when a string type is specified.

(B) If we allow implicit cast here(I've not found the rule justifying it), it is reasonable to follow the rules of CAST. Based on 6.13 <cast specification> -> General Rules -> 8) -> b), it should throw an exception.

(C) If the implicit numeric cast of text can be null here(though it sounds like we accumulate too many assumptions, it can keep the current behavior...), based on 10.9 <aggregate function> -> General Rules -> 7) -> a), we should eliminate null values with warning. If the result set is empty, based on 10.9 <aggregate function> -> General Rules -> 7) -> d) -> ⅱ), it should return NULL.

So, if (A) we follow the standard strictly or (B) we accept implicit cast, SUM and AVG should fail. If we justify (C) the null conversion, avg('text') and sum('text') should return NULL. NULL / 1 should be NULL based on 6.30 <numeric value expression> -> General Rules -> 1). So, in this case, all the cases, avg('text'), sum('text'), sum('text') / count('text') should be null with raising a warning.

In summary, if we accept a breaking change, I would say either of the following ways sounds consistent. What do you think?

We no longer accept the string type for AVG or SUM. It leads to a huge incompatibility, but we can follow the standard
We will make the UDFs skip invalid texts and SUM('text') return NULL. Maybe we will fix GenericUDAFSum.

Apr 18 '24 15:04 okumin

Chiming in late on this topic .. I think that the CBO path should be the main focus since it has been the default for quite some time now - with Hive 3 and Hive 4. In this path, to avoid breaking change, we should not throw an exception for the string argument type for the aggregate function even though the SQL standard suggests otherwise. In an ETL operation, there could be one row out of millions that has 'dirty' data with string type but Hive has been permissive for such data, otherwise the whole job would fail. If it means that we should let SUM('text') return NULL (similar to AVG('text')), it seems a better option than erroring out.

Apr 30 '24 16:04 amansinha100

We will make the UDFs skip invalid texts and SUM('text') return NULL. Maybe we will fix GenericUDAFSum.

Yes, it seems reasonable because Hive doesn't follow the standard in the case of cast function either: Cast returns NULL instead of throwing exception due to the reasons Aman mentioned.

I think we can consider sum(string) as an overload of the standard sum function which should behave like sum(cast('text' as <numeric_type>))

May 02 '24 09:05 kasakrisz

Thank you both. I believe we strongly agree with some points.

We are not motivated to keep the current behaviors of the non-CBO mode. Those of CBO are rather preferred
We would like to accept texts as an argument of AVG or other UDFs here

Let me check AVG/STDDEV_(POP|SAMP)/VAR_(POP/SAMP) and CAST, and propose reasonable behaviors.

May 02 '24 13:05 okumin

This table illustrates the summary of the current behavior tested by https://github.com/apache/hive/pull/5091/commits/468ea05ea4dc5b5ff93e7417653fe663b098a4bf. I will add my proposal in the next comment.

	CBO	Non-CBO	Note
$SUM0(c_numeric)	228.0	228.0
$SUM0(CAST(c_numeric AS DOUBLE))	228.0	228.0
$SUM0(c_non_numeric)	0.0	0.0
$SUM0(CAST(c_non_numeric AS DOUBLE))	0.0	0.0
$SUM0(c_mix)	193.0	193.0
$SUM0(CAST(c_mix AS DOUBLE))	193.0	193.0
AVG(c_numeric)	28.5	28.5
AVG(CAST(c_numeric AS DOUBLE))	28.5	28.5
AVG(c_non_numeric)	0.0	NULL	Inconsistent
AVG(CAST(c_non_numeric AS DOUBLE))	NULL	NULL
AVG(c_mix)	24.125	32.167	Inconsistent
AVG(CAST(c_mix AS DOUBLE))	32.167	32.167
STDDEV_POP(c_numeric)	47.344	47.344
STDDEV_POP(CAST(c_numeric AS DOUBLE))	47.344	47.344
STDDEV_POP(c_non_numeric)	NULL	NULL
STDDEV_POP(CAST(c_non_numeric AS DOUBLE))	NULL	NULL
STDDEV_POP(c_mix)	48.401	53.524	Inconsistent
STDDEV_POP(CAST(c_mix AS DOUBLE))	53.524	53.524
STDDEV_SAMP(c_numeric)	50.613	50.613
STDDEV_SAMP(CAST(c_numeric AS DOUBLE))	50.613	50.613
STDDEV_SAMP(c_non_numeric)	NULL	NULL
STDDEV_SAMP(CAST(c_non_numeric AS DOUBLE))	NULL	NULL
STDDEV_SAMP(c_mix)	51.742	58.632	Inconsistent
STDDEV_SAMP(CAST(c_mix AS DOUBLE))	58.632	58.632
VAR_POP(c_numeric)	2241.5	2241.5
VAR_POP(CAST(c_numeric AS DOUBLE))	2241.5	2241.5
VAR_POP(c_non_numeric)	NULL	NULL
VAR_POP(CAST(c_non_numeric AS DOUBLE))	NULL	NULL
VAR_POP(c_mix)	2342.609	2864.806	Inconsistent
VAR_POP(CAST(c_mix AS DOUBLE))	2864.806	2864.806
VAR_SAMP(c_numeric)	2561.714	2561.714
VAR_SAMP(CAST(c_numeric AS DOUBLE))	2561.714	2561.714
VAR_SAMP(c_non_numeric)	NULL	NULL
VAR_SAMP(CAST(c_non_numeric AS DOUBLE))	NULL	NULL
VAR_SAMP(c_mix)	2677.268	3437.767	Inconsistent
VAR_SAMP(CAST(c_mix AS DOUBLE))	3437.767	3437.767
SUM(c_numeric)	228.0	228.0
SUM(CAST(c_numeric AS DOUBLE))	228.0	228.0
SUM(c_non_numeric)	0.0	0.0
SUM(CAST(c_non_numeric AS DOUBLE))	NULL	NULL
SUM(c_mix)	193.0	193.0
SUM(CAST(c_mix AS DOUBLE))	193.0	193.0
COUNT(c_numeric)	8	8
COUNT(CAST(c_numeric AS DOUBLE))	8	8
COUNT(c_non_numeric)	8	8
COUNT(CAST(c_non_numeric AS DOUBLE))	0	0
COUNT(c_mix)	8	8
COUNT(CAST(c_mix AS DOUBLE))	6	6

May 05 '24 16:05 okumin

Digest of our discussions

Let me clarify the requirements of HIVE-28082.

Any SQLs have to return the same results regardless of HiveAggregateReduceFunctionsRule
$SUM0/AVG/STDDEV_(POP|SAMP)/VAR_(POP/SAMP) have to accept non numeric types

We would like to satisfy the following conditions. We may need to compromise some of them if they conflict with each other.

(A) We would like to justify the CBO path if it is working differently from the non-CBO mode
(B) avg(x) should be equal to sum(x) / count(x)
(C) sum(c_string) should be equal to sum(cast(c_string as double))

Strange behaviors we've observed

As we see in the above table, we observe inconsistent results in the following cases.

	CBO	Non-CBO	Note
AVG(c_non_numeric)	0.0	NULL	Inconsistent
AVG(c_mix)	24.125	32.167	Inconsistent
STDDEV_POP(c_mix)	48.401	53.524	Inconsistent
STDDEV_SAMP(c_mix)	51.742	58.632	Inconsistent
VAR_POP(c_mix)	2342.609	2864.806	Inconsistent
VAR_SAMP(c_mix)	2677.268	3437.767	Inconsistent

My new findings

Checking the 6 test cases, The root cause is likely located in HiveAggregateReduceFunctionsRule. All UDAF should skip null values, including SUM or COUNT, they actually behave like that. The problem with non-numeric values happens because we don't take care of the arg type of COUNT. SUM(non_numeric_text) mostly[1] behaves in the same way as SUM(CAST(non_numeric_text AS DOUBLE)), but COUNT(non_numeric_text) never behaves so since non_numeric_text is totally valid for COUNT. In the case of AVG, the numerator, SUM, doesn't count illegal numbers but the denominator side, COUNT does[2].

I'm currently trying to explicitly cast the arg of COUNT so that SUM and COUNT share the same null-check semantics. It doesn't satisfy (A) but I believe it is reasonable because the current AVG(str) doesn't meet the semantics of average.

@kasakrisz @amansinha100 Please feel free to give your opinions to me if we have.

[1] I believe SUM should return NULL if all rows are evaluated as NULL, but it returns zero
[2] We can simply reproduce the problem with the following queries

> WITH a AS (SELECT 1.0 AS i UNION ALL SELECT NULL AS i UNION ALL SELECT 2.0 AS i) SELECT AVG(i), SUM(i), COUNT(i) FROM a;
+---------+------+------+
|   _c0   | _c1  | _c2  |
+---------+------+------+
| 1.5000  | 3    | 2    |
+---------+------+------+

> WITH a AS (SELECT '1' AS i UNION ALL SELECT 'null' AS i UNION ALL SELECT '2' AS i) SELECT AVG(i), SUM(i), COUNT(i) FROM a;
+------+------+------+
| _c0  | _c1  | _c2  |
+------+------+------+
| 1.0  | 3.0  | 3    |
+------+------+------+

> WITH a AS (SELECT '1' AS i UNION ALL SELECT 'null' AS i UNION ALL SELECT '2' AS i) SELECT AVG(CAST(i AS DOUBLE)), SUM(CAST(i AS DOUBLE)), COUNT(CAST(i AS DOUBLE)) FROM a;
+------+------+------+
| _c0  | _c1  | _c2  |
+------+------+------+
| 1.5  | 3.0  | 2    |
+------+------+------+

May 06 '24 08:05 okumin

@okumin Thanks for the research and sharing the details of your results. I think these are the issues to solve:

as you mentioned: "I believe SUM should return NULL if all rows are evaluated as NULL" on both CBO and non-CBO path. Swapping these lines may help to solve this: https://github.com/apache/hive/blob/7950967eae9640fcc0aa22f4b6c7906b34281eac/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFSum.java#L443-L444
Add an explicit cast in case the parameter of the aggregates mentioned has type from the character family to make sure that count only counts values which can be converted to a numeric type.

Please add tests for sum(c_non_numeric) and sum(c_mix) if not exists.

May 10 '24 11:05 kasakrisz

@kasakrisz Thanks. I found cbo_rp_groupby3_noskew_multi_distinct.q fails and it turned out the root cause is not HiveAggregateReduceFunctionsRule but CBO Return Path + non-map-side aggregation. We may check the problem first.

https://github.com/apache/hive/pull/5245

Add an explicit cast in case the parameter of the aggregates mentioned has type from the character family to make sure that count only counts values which can be converted to a numeric type.

My PoC is here. We need #5245 . https://github.com/apache/hive/pull/5091/commits/da7507d3caa2150d8282412f801b8e9854aef293

as you mentioned: "I believe SUM should return NULL if all rows are evaluated as NULL" on both CBO and non-CBO path. Swapping these lines may help to solve this:

I believe you mentioned how to correct the following semantics. I will try. Can I resolve it in this ticket? Literally, SUM is not a member accelerated by HiveAggregateReduceFunctionsRule. And if we add an explicit cast, at least all problems related to HiveAggregateReduceFunctionsRule will likely be gone. It is up to your convenience.

	CBO	Non-CBO	Note
SUM(c_non_numeric)	0.0	0.0
SUM(CAST(c_non_numeric AS DOUBLE))	NULL	NULL

May 13 '24 01:05 okumin

@okumin

I believe you mentioned how to correct the following semantics. I will try. Can I resolve it in this ticket?

It is ok to file another jira to track this.

~~Sorry I'm confused. Is #5245 somehow related to this patch? Could you please clarify.~~ Found the change in cbo_rp_groupby3_noskew_multi_distinct.q

May 13 '24 12:05 kasakrisz

@okumin #5245 was meged to master. Could you please rebase this patch.

Jun 04 '24 07:06 kasakrisz

Rebased & followed your comments. Waiting for CI to finish... https://github.com/apache/hive/pull/5091/commits/e78e27bc593c20a312c76b7ba9c06b96176bc647

Jun 04 '24 11:06 okumin

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

Jun 04 '24 13:06 sonarqubecloud[bot]

CI passed.

as you mentioned: "I believe SUM should return NULL if all rows are evaluated as NULL" on both CBO and non-CBO path. Swapping these lines may help to solve this:

BTW, I filed another ticket for this change. https://issues.apache.org/jira/browse/HIVE-28302

Jun 05 '24 06:06 okumin

HIVE-28082: HiveAggregateReduceFunctionsRule could generate an inconsistent result

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

Quality Gate passed

Digest of our discussions

Strange behaviors we've observed

My new findings

Quality Gate passed