HIVE-28082: HiveAggregateReduceFunctionsRule could generate an inconsistent result
What changes were proposed in this pull request?
This PR would disable HiveAggregateReduceFunctionsRule when the type of argument of AVG/STDDEV_(POP|SAMP)/VAR_(POP/SAMP) is a string-like type.
https://issues.apache.org/jira/browse/HIVE-28082
Why are the changes needed?
While testing Hive 4, we found the behavior of those UDFs changed since Hive 2. We identified HiveAggregateReduceFunctionsRule as the root cause. We believe Calcite rules should not change the original semantics.
This PR would disable the rule when it could cause inconsistency. That's because it is not simple to decompose AVG(str) into SUM(str) and COUNT(*), or correctly transform STDDEV_* or VAR_*.
Does this PR introduce any user-facing change?
The result could change if a user uses the UDFs.
Is the change a dependency upgrade?
No
How was this patch tested?
I added/updated test cases. You can see the result of the current master branch through the first commit to add a new test case. Results change with or without CBO.
Quality Gate passed
Issues
2 New issues
Measures
0 Security Hotspots
No data about Coverage
No data about Duplication
@kasakrisz Thanks for checking. I double-checked the SQL:2023 Part 2.
(A) 10.9 <aggregate function> -> Syntax Rules -> 7) -> g) mentions the specification of SUM and AVG. It says the type of value expression shall be numeric. So, if we strictly adhere to the standard, it could be best to throw an exception when a string type is specified.
(B) If we allow implicit cast here(I've not found the rule justifying it), it is reasonable to follow the rules of CAST. Based on 6.13 <cast specification> -> General Rules -> 8) -> b), it should throw an exception.
(C) If the implicit numeric cast of text can be null here(though it sounds like we accumulate too many assumptions, it can keep the current behavior...), based on 10.9 <aggregate function> -> General Rules -> 7) -> a), we should eliminate null values with warning. If the result set is empty, based on 10.9 <aggregate function> -> General Rules -> 7) -> d) -> ⅱ), it should return NULL.
So, if (A) we follow the standard strictly or (B) we accept implicit cast, SUM and AVG should fail.
If we justify (C) the null conversion, avg('text') and sum('text') should return NULL. NULL / 1 should be NULL based on 6.30 <numeric value expression> -> General Rules -> 1). So, in this case, all the cases, avg('text'), sum('text'), sum('text') / count('text') should be null with raising a warning.
In summary, if we accept a breaking change, I would say either of the following ways sounds consistent. What do you think?
- We no longer accept the string type for
AVGorSUM. It leads to a huge incompatibility, but we can follow the standard - We will make the UDFs skip invalid texts and
SUM('text')return NULL. Maybe we will fixGenericUDAFSum.
Chiming in late on this topic .. I think that the CBO path should be the main focus since it has been the default for quite some time now - with Hive 3 and Hive 4. In this path, to avoid breaking change, we should not throw an exception for the string argument type for the aggregate function even though the SQL standard suggests otherwise. In an ETL operation, there could be one row out of millions that has 'dirty' data with string type but Hive has been permissive for such data, otherwise the whole job would fail. If it means that we should let SUM('text') return NULL (similar to AVG('text')), it seems a better option than erroring out.
We will make the UDFs skip invalid texts and SUM('text') return NULL. Maybe we will fix GenericUDAFSum.
Yes, it seems reasonable because Hive doesn't follow the standard in the case of cast function either: Cast returns NULL instead of throwing exception due to the reasons Aman mentioned.
I think we can consider sum(string) as an overload of the standard sum function which should behave like sum(cast('text' as <numeric_type>))
Thank you both. I believe we strongly agree with some points.
- We are not motivated to keep the current behaviors of the non-CBO mode. Those of CBO are rather preferred
- We would like to accept texts as an argument of AVG or other UDFs here
Let me check AVG/STDDEV_(POP|SAMP)/VAR_(POP/SAMP) and CAST, and propose reasonable behaviors.
This table illustrates the summary of the current behavior tested by https://github.com/apache/hive/pull/5091/commits/468ea05ea4dc5b5ff93e7417653fe663b098a4bf. I will add my proposal in the next comment.
| CBO | Non-CBO | Note | |
|---|---|---|---|
| $SUM0(c_numeric) | 228.0 | 228.0 | |
| $SUM0(CAST(c_numeric AS DOUBLE)) | 228.0 | 228.0 | |
| $SUM0(c_non_numeric) | 0.0 | 0.0 | |
| $SUM0(CAST(c_non_numeric AS DOUBLE)) | 0.0 | 0.0 | |
| $SUM0(c_mix) | 193.0 | 193.0 | |
| $SUM0(CAST(c_mix AS DOUBLE)) | 193.0 | 193.0 | |
| AVG(c_numeric) | 28.5 | 28.5 | |
| AVG(CAST(c_numeric AS DOUBLE)) | 28.5 | 28.5 | |
| AVG(c_non_numeric) | 0.0 | NULL | Inconsistent |
| AVG(CAST(c_non_numeric AS DOUBLE)) | NULL | NULL | |
| AVG(c_mix) | 24.125 | 32.167 | Inconsistent |
| AVG(CAST(c_mix AS DOUBLE)) | 32.167 | 32.167 | |
| STDDEV_POP(c_numeric) | 47.344 | 47.344 | |
| STDDEV_POP(CAST(c_numeric AS DOUBLE)) | 47.344 | 47.344 | |
| STDDEV_POP(c_non_numeric) | NULL | NULL | |
| STDDEV_POP(CAST(c_non_numeric AS DOUBLE)) | NULL | NULL | |
| STDDEV_POP(c_mix) | 48.401 | 53.524 | Inconsistent |
| STDDEV_POP(CAST(c_mix AS DOUBLE)) | 53.524 | 53.524 | |
| STDDEV_SAMP(c_numeric) | 50.613 | 50.613 | |
| STDDEV_SAMP(CAST(c_numeric AS DOUBLE)) | 50.613 | 50.613 | |
| STDDEV_SAMP(c_non_numeric) | NULL | NULL | |
| STDDEV_SAMP(CAST(c_non_numeric AS DOUBLE)) | NULL | NULL | |
| STDDEV_SAMP(c_mix) | 51.742 | 58.632 | Inconsistent |
| STDDEV_SAMP(CAST(c_mix AS DOUBLE)) | 58.632 | 58.632 | |
| VAR_POP(c_numeric) | 2241.5 | 2241.5 | |
| VAR_POP(CAST(c_numeric AS DOUBLE)) | 2241.5 | 2241.5 | |
| VAR_POP(c_non_numeric) | NULL | NULL | |
| VAR_POP(CAST(c_non_numeric AS DOUBLE)) | NULL | NULL | |
| VAR_POP(c_mix) | 2342.609 | 2864.806 | Inconsistent |
| VAR_POP(CAST(c_mix AS DOUBLE)) | 2864.806 | 2864.806 | |
| VAR_SAMP(c_numeric) | 2561.714 | 2561.714 | |
| VAR_SAMP(CAST(c_numeric AS DOUBLE)) | 2561.714 | 2561.714 | |
| VAR_SAMP(c_non_numeric) | NULL | NULL | |
| VAR_SAMP(CAST(c_non_numeric AS DOUBLE)) | NULL | NULL | |
| VAR_SAMP(c_mix) | 2677.268 | 3437.767 | Inconsistent |
| VAR_SAMP(CAST(c_mix AS DOUBLE)) | 3437.767 | 3437.767 | |
| SUM(c_numeric) | 228.0 | 228.0 | |
| SUM(CAST(c_numeric AS DOUBLE)) | 228.0 | 228.0 | |
| SUM(c_non_numeric) | 0.0 | 0.0 | |
| SUM(CAST(c_non_numeric AS DOUBLE)) | NULL | NULL | |
| SUM(c_mix) | 193.0 | 193.0 | |
| SUM(CAST(c_mix AS DOUBLE)) | 193.0 | 193.0 | |
| COUNT(c_numeric) | 8 | 8 | |
| COUNT(CAST(c_numeric AS DOUBLE)) | 8 | 8 | |
| COUNT(c_non_numeric) | 8 | 8 | |
| COUNT(CAST(c_non_numeric AS DOUBLE)) | 0 | 0 | |
| COUNT(c_mix) | 8 | 8 | |
| COUNT(CAST(c_mix AS DOUBLE)) | 6 | 6 |
Digest of our discussions
Let me clarify the requirements of HIVE-28082.
- Any SQLs have to return the same results regardless of HiveAggregateReduceFunctionsRule
- $SUM0/AVG/STDDEV_(POP|SAMP)/VAR_(POP/SAMP) have to accept non numeric types
We would like to satisfy the following conditions. We may need to compromise some of them if they conflict with each other.
- (A) We would like to justify the CBO path if it is working differently from the non-CBO mode
- (B)
avg(x)should be equal tosum(x) / count(x) - (C)
sum(c_string)should be equal tosum(cast(c_string as double))
Strange behaviors we've observed
As we see in the above table, we observe inconsistent results in the following cases.
| CBO | Non-CBO | Note | |
|---|---|---|---|
| AVG(c_non_numeric) | 0.0 | NULL | Inconsistent |
| AVG(c_mix) | 24.125 | 32.167 | Inconsistent |
| STDDEV_POP(c_mix) | 48.401 | 53.524 | Inconsistent |
| STDDEV_SAMP(c_mix) | 51.742 | 58.632 | Inconsistent |
| VAR_POP(c_mix) | 2342.609 | 2864.806 | Inconsistent |
| VAR_SAMP(c_mix) | 2677.268 | 3437.767 | Inconsistent |
My new findings
Checking the 6 test cases, The root cause is likely located in HiveAggregateReduceFunctionsRule. All UDAF should skip null values, including SUM or COUNT, they actually behave like that.
The problem with non-numeric values happens because we don't take care of the arg type of COUNT. SUM(non_numeric_text) mostly[1] behaves in the same way as SUM(CAST(non_numeric_text AS DOUBLE)), but COUNT(non_numeric_text) never behaves so since non_numeric_text is totally valid for COUNT. In the case of AVG, the numerator, SUM, doesn't count illegal numbers but the denominator side, COUNT does[2].
I'm currently trying to explicitly cast the arg of COUNT so that SUM and COUNT share the same null-check semantics. It doesn't satisfy (A) but I believe it is reasonable because the current AVG(str) doesn't meet the semantics of average.
@kasakrisz @amansinha100 Please feel free to give your opinions to me if we have.
- [1] I believe
SUMshould returnNULLif all rows are evaluated asNULL, but it returns zero - [2] We can simply reproduce the problem with the following queries
> WITH a AS (SELECT 1.0 AS i UNION ALL SELECT NULL AS i UNION ALL SELECT 2.0 AS i) SELECT AVG(i), SUM(i), COUNT(i) FROM a;
+---------+------+------+
| _c0 | _c1 | _c2 |
+---------+------+------+
| 1.5000 | 3 | 2 |
+---------+------+------+
> WITH a AS (SELECT '1' AS i UNION ALL SELECT 'null' AS i UNION ALL SELECT '2' AS i) SELECT AVG(i), SUM(i), COUNT(i) FROM a;
+------+------+------+
| _c0 | _c1 | _c2 |
+------+------+------+
| 1.0 | 3.0 | 3 |
+------+------+------+
> WITH a AS (SELECT '1' AS i UNION ALL SELECT 'null' AS i UNION ALL SELECT '2' AS i) SELECT AVG(CAST(i AS DOUBLE)), SUM(CAST(i AS DOUBLE)), COUNT(CAST(i AS DOUBLE)) FROM a;
+------+------+------+
| _c0 | _c1 | _c2 |
+------+------+------+
| 1.5 | 3.0 | 2 |
+------+------+------+
@okumin Thanks for the research and sharing the details of your results. I think these are the issues to solve:
- as you mentioned: "I believe SUM should return NULL if all rows are evaluated as NULL" on both CBO and non-CBO path. Swapping these lines may help to solve this: https://github.com/apache/hive/blob/7950967eae9640fcc0aa22f4b6c7906b34281eac/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFSum.java#L443-L444
- Add an explicit cast in case the parameter of the aggregates mentioned has type from the character family to make sure that
countonly counts values which can be converted to a numeric type.
Please add tests for sum(c_non_numeric) and sum(c_mix) if not exists.
@kasakrisz Thanks. I found cbo_rp_groupby3_noskew_multi_distinct.q fails and it turned out the root cause is not HiveAggregateReduceFunctionsRule but CBO Return Path + non-map-side aggregation. We may check the problem first.
https://github.com/apache/hive/pull/5245
Add an explicit cast in case the parameter of the aggregates mentioned has type from the character family to make sure that count only counts values which can be converted to a numeric type.
My PoC is here. We need #5245 . https://github.com/apache/hive/pull/5091/commits/da7507d3caa2150d8282412f801b8e9854aef293
as you mentioned: "I believe SUM should return NULL if all rows are evaluated as NULL" on both CBO and non-CBO path. Swapping these lines may help to solve this:
I believe you mentioned how to correct the following semantics. I will try. Can I resolve it in this ticket? Literally, SUM is not a member accelerated by HiveAggregateReduceFunctionsRule. And if we add an explicit cast, at least all problems related to HiveAggregateReduceFunctionsRule will likely be gone. It is up to your convenience.
| CBO | Non-CBO | Note | |
|---|---|---|---|
| SUM(c_non_numeric) | 0.0 | 0.0 | |
| SUM(CAST(c_non_numeric AS DOUBLE)) | NULL | NULL |
@okumin
I believe you mentioned how to correct the following semantics. I will try. Can I resolve it in this ticket?
It is ok to file another jira to track this.
~~Sorry I'm confused. Is #5245 somehow related to this patch? Could you please clarify.~~
Found the change in cbo_rp_groupby3_noskew_multi_distinct.q
@okumin #5245 was meged to master. Could you please rebase this patch.
Rebased & followed your comments. Waiting for CI to finish... https://github.com/apache/hive/pull/5091/commits/e78e27bc593c20a312c76b7ba9c06b96176bc647
Quality Gate passed
Issues
1 New issue
0 Accepted issues
Measures
0 Security Hotspots
No data about Coverage
No data about Duplication
CI passed.
as you mentioned: "I believe SUM should return NULL if all rows are evaluated as NULL" on both CBO and non-CBO path. Swapping these lines may help to solve this:
BTW, I filed another ticket for this change. https://issues.apache.org/jira/browse/HIVE-28302