hive icon indicating copy to clipboard operation
hive copied to clipboard

HIVE-28082: HiveAggregateReduceFunctionsRule could generate an inconsistent result

Open okumin opened this issue 2 years ago • 1 comments

What changes were proposed in this pull request?

This PR would disable HiveAggregateReduceFunctionsRule when the type of argument of AVG/STDDEV_(POP|SAMP)/VAR_(POP/SAMP) is a string-like type.

https://issues.apache.org/jira/browse/HIVE-28082

Why are the changes needed?

While testing Hive 4, we found the behavior of those UDFs changed since Hive 2. We identified HiveAggregateReduceFunctionsRule as the root cause. We believe Calcite rules should not change the original semantics.

This PR would disable the rule when it could cause inconsistency. That's because it is not simple to decompose AVG(str) into SUM(str) and COUNT(*), or correctly transform STDDEV_* or VAR_*.

Does this PR introduce any user-facing change?

The result could change if a user uses the UDFs.

Is the change a dependency upgrade?

No

How was this patch tested?

I added/updated test cases. You can see the result of the current master branch through the first commit to add a new test case. Results change with or without CBO.

okumin avatar Feb 20 '24 03:02 okumin

Quality Gate Passed Quality Gate passed

Issues
2 New issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

sonarqubecloud[bot] avatar Feb 21 '24 09:02 sonarqubecloud[bot]

@kasakrisz Thanks for checking. I double-checked the SQL:2023 Part 2.

(A) 10.9 <aggregate function> -> Syntax Rules -> 7) -> g) mentions the specification of SUM and AVG. It says the type of value expression shall be numeric. So, if we strictly adhere to the standard, it could be best to throw an exception when a string type is specified.

(B) If we allow implicit cast here(I've not found the rule justifying it), it is reasonable to follow the rules of CAST. Based on 6.13 <cast specification> -> General Rules -> 8) -> b), it should throw an exception.

(C) If the implicit numeric cast of text can be null here(though it sounds like we accumulate too many assumptions, it can keep the current behavior...), based on 10.9 <aggregate function> -> General Rules -> 7) -> a), we should eliminate null values with warning. If the result set is empty, based on 10.9 <aggregate function> -> General Rules -> 7) -> d) -> ⅱ), it should return NULL.

So, if (A) we follow the standard strictly or (B) we accept implicit cast, SUM and AVG should fail. If we justify (C) the null conversion, avg('text') and sum('text') should return NULL. NULL / 1 should be NULL based on 6.30 <numeric value expression> -> General Rules -> 1). So, in this case, all the cases, avg('text'), sum('text'), sum('text') / count('text') should be null with raising a warning.

In summary, if we accept a breaking change, I would say either of the following ways sounds consistent. What do you think?

  • We no longer accept the string type for AVG or SUM. It leads to a huge incompatibility, but we can follow the standard
  • We will make the UDFs skip invalid texts and SUM('text') return NULL. Maybe we will fix GenericUDAFSum.

okumin avatar Apr 18 '24 15:04 okumin

Chiming in late on this topic .. I think that the CBO path should be the main focus since it has been the default for quite some time now - with Hive 3 and Hive 4. In this path, to avoid breaking change, we should not throw an exception for the string argument type for the aggregate function even though the SQL standard suggests otherwise. In an ETL operation, there could be one row out of millions that has 'dirty' data with string type but Hive has been permissive for such data, otherwise the whole job would fail. If it means that we should let SUM('text') return NULL (similar to AVG('text')), it seems a better option than erroring out.

amansinha100 avatar Apr 30 '24 16:04 amansinha100

We will make the UDFs skip invalid texts and SUM('text') return NULL. Maybe we will fix GenericUDAFSum.

Yes, it seems reasonable because Hive doesn't follow the standard in the case of cast function either: Cast returns NULL instead of throwing exception due to the reasons Aman mentioned.

I think we can consider sum(string) as an overload of the standard sum function which should behave like sum(cast('text' as <numeric_type>))

kasakrisz avatar May 02 '24 09:05 kasakrisz

Thank you both. I believe we strongly agree with some points.

  • We are not motivated to keep the current behaviors of the non-CBO mode. Those of CBO are rather preferred
  • We would like to accept texts as an argument of AVG or other UDFs here

Let me check AVG/STDDEV_(POP|SAMP)/VAR_(POP/SAMP) and CAST, and propose reasonable behaviors.

okumin avatar May 02 '24 13:05 okumin

This table illustrates the summary of the current behavior tested by https://github.com/apache/hive/pull/5091/commits/468ea05ea4dc5b5ff93e7417653fe663b098a4bf. I will add my proposal in the next comment.

CBO Non-CBO Note
$SUM0(c_numeric) 228.0 228.0
$SUM0(CAST(c_numeric AS DOUBLE)) 228.0 228.0
$SUM0(c_non_numeric) 0.0 0.0
$SUM0(CAST(c_non_numeric AS DOUBLE)) 0.0 0.0
$SUM0(c_mix) 193.0 193.0
$SUM0(CAST(c_mix AS DOUBLE)) 193.0 193.0
AVG(c_numeric) 28.5 28.5
AVG(CAST(c_numeric AS DOUBLE)) 28.5 28.5
AVG(c_non_numeric) 0.0 NULL Inconsistent
AVG(CAST(c_non_numeric AS DOUBLE)) NULL NULL
AVG(c_mix) 24.125 32.167 Inconsistent
AVG(CAST(c_mix AS DOUBLE)) 32.167 32.167
STDDEV_POP(c_numeric) 47.344 47.344
STDDEV_POP(CAST(c_numeric AS DOUBLE)) 47.344 47.344
STDDEV_POP(c_non_numeric) NULL NULL
STDDEV_POP(CAST(c_non_numeric AS DOUBLE)) NULL NULL
STDDEV_POP(c_mix) 48.401 53.524 Inconsistent
STDDEV_POP(CAST(c_mix AS DOUBLE)) 53.524 53.524
STDDEV_SAMP(c_numeric) 50.613 50.613
STDDEV_SAMP(CAST(c_numeric AS DOUBLE)) 50.613 50.613
STDDEV_SAMP(c_non_numeric) NULL NULL
STDDEV_SAMP(CAST(c_non_numeric AS DOUBLE)) NULL NULL
STDDEV_SAMP(c_mix) 51.742 58.632 Inconsistent
STDDEV_SAMP(CAST(c_mix AS DOUBLE)) 58.632 58.632
VAR_POP(c_numeric) 2241.5 2241.5
VAR_POP(CAST(c_numeric AS DOUBLE)) 2241.5 2241.5
VAR_POP(c_non_numeric) NULL NULL
VAR_POP(CAST(c_non_numeric AS DOUBLE)) NULL NULL
VAR_POP(c_mix) 2342.609 2864.806 Inconsistent
VAR_POP(CAST(c_mix AS DOUBLE)) 2864.806 2864.806
VAR_SAMP(c_numeric) 2561.714 2561.714
VAR_SAMP(CAST(c_numeric AS DOUBLE)) 2561.714 2561.714
VAR_SAMP(c_non_numeric) NULL NULL
VAR_SAMP(CAST(c_non_numeric AS DOUBLE)) NULL NULL
VAR_SAMP(c_mix) 2677.268 3437.767 Inconsistent
VAR_SAMP(CAST(c_mix AS DOUBLE)) 3437.767 3437.767
SUM(c_numeric) 228.0 228.0
SUM(CAST(c_numeric AS DOUBLE)) 228.0 228.0
SUM(c_non_numeric) 0.0 0.0
SUM(CAST(c_non_numeric AS DOUBLE)) NULL NULL
SUM(c_mix) 193.0 193.0
SUM(CAST(c_mix AS DOUBLE)) 193.0 193.0
COUNT(c_numeric) 8 8
COUNT(CAST(c_numeric AS DOUBLE)) 8 8
COUNT(c_non_numeric) 8 8
COUNT(CAST(c_non_numeric AS DOUBLE)) 0 0
COUNT(c_mix) 8 8
COUNT(CAST(c_mix AS DOUBLE)) 6 6

okumin avatar May 05 '24 16:05 okumin

Digest of our discussions

Let me clarify the requirements of HIVE-28082.

  • Any SQLs have to return the same results regardless of HiveAggregateReduceFunctionsRule
  • $SUM0/AVG/STDDEV_(POP|SAMP)/VAR_(POP/SAMP) have to accept non numeric types

We would like to satisfy the following conditions. We may need to compromise some of them if they conflict with each other.

  • (A) We would like to justify the CBO path if it is working differently from the non-CBO mode
  • (B) avg(x) should be equal to sum(x) / count(x)
  • (C) sum(c_string) should be equal to sum(cast(c_string as double))

Strange behaviors we've observed

As we see in the above table, we observe inconsistent results in the following cases.

CBO Non-CBO Note
AVG(c_non_numeric) 0.0 NULL Inconsistent
AVG(c_mix) 24.125 32.167 Inconsistent
STDDEV_POP(c_mix) 48.401 53.524 Inconsistent
STDDEV_SAMP(c_mix) 51.742 58.632 Inconsistent
VAR_POP(c_mix) 2342.609 2864.806 Inconsistent
VAR_SAMP(c_mix) 2677.268 3437.767 Inconsistent

My new findings

Checking the 6 test cases, The root cause is likely located in HiveAggregateReduceFunctionsRule. All UDAF should skip null values, including SUM or COUNT, they actually behave like that. The problem with non-numeric values happens because we don't take care of the arg type of COUNT. SUM(non_numeric_text) mostly[1] behaves in the same way as SUM(CAST(non_numeric_text AS DOUBLE)), but COUNT(non_numeric_text) never behaves so since non_numeric_text is totally valid for COUNT. In the case of AVG, the numerator, SUM, doesn't count illegal numbers but the denominator side, COUNT does[2].

I'm currently trying to explicitly cast the arg of COUNT so that SUM and COUNT share the same null-check semantics. It doesn't satisfy (A) but I believe it is reasonable because the current AVG(str) doesn't meet the semantics of average.

@kasakrisz @amansinha100 Please feel free to give your opinions to me if we have.

  • [1] I believe SUM should return NULL if all rows are evaluated as NULL, but it returns zero
  • [2] We can simply reproduce the problem with the following queries
> WITH a AS (SELECT 1.0 AS i UNION ALL SELECT NULL AS i UNION ALL SELECT 2.0 AS i) SELECT AVG(i), SUM(i), COUNT(i) FROM a;
+---------+------+------+
|   _c0   | _c1  | _c2  |
+---------+------+------+
| 1.5000  | 3    | 2    |
+---------+------+------+

> WITH a AS (SELECT '1' AS i UNION ALL SELECT 'null' AS i UNION ALL SELECT '2' AS i) SELECT AVG(i), SUM(i), COUNT(i) FROM a;
+------+------+------+
| _c0  | _c1  | _c2  |
+------+------+------+
| 1.0  | 3.0  | 3    |
+------+------+------+

> WITH a AS (SELECT '1' AS i UNION ALL SELECT 'null' AS i UNION ALL SELECT '2' AS i) SELECT AVG(CAST(i AS DOUBLE)), SUM(CAST(i AS DOUBLE)), COUNT(CAST(i AS DOUBLE)) FROM a;
+------+------+------+
| _c0  | _c1  | _c2  |
+------+------+------+
| 1.5  | 3.0  | 2    |
+------+------+------+

okumin avatar May 06 '24 08:05 okumin

@okumin Thanks for the research and sharing the details of your results. I think these are the issues to solve:

  • as you mentioned: "I believe SUM should return NULL if all rows are evaluated as NULL" on both CBO and non-CBO path. Swapping these lines may help to solve this: https://github.com/apache/hive/blob/7950967eae9640fcc0aa22f4b6c7906b34281eac/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFSum.java#L443-L444
  • Add an explicit cast in case the parameter of the aggregates mentioned has type from the character family to make sure that count only counts values which can be converted to a numeric type.

Please add tests for sum(c_non_numeric) and sum(c_mix) if not exists.

kasakrisz avatar May 10 '24 11:05 kasakrisz

@kasakrisz Thanks. I found cbo_rp_groupby3_noskew_multi_distinct.q fails and it turned out the root cause is not HiveAggregateReduceFunctionsRule but CBO Return Path + non-map-side aggregation. We may check the problem first.

https://github.com/apache/hive/pull/5245

Add an explicit cast in case the parameter of the aggregates mentioned has type from the character family to make sure that count only counts values which can be converted to a numeric type.

My PoC is here. We need #5245 . https://github.com/apache/hive/pull/5091/commits/da7507d3caa2150d8282412f801b8e9854aef293

as you mentioned: "I believe SUM should return NULL if all rows are evaluated as NULL" on both CBO and non-CBO path. Swapping these lines may help to solve this:

I believe you mentioned how to correct the following semantics. I will try. Can I resolve it in this ticket? Literally, SUM is not a member accelerated by HiveAggregateReduceFunctionsRule. And if we add an explicit cast, at least all problems related to HiveAggregateReduceFunctionsRule will likely be gone. It is up to your convenience.

CBO Non-CBO Note
SUM(c_non_numeric) 0.0 0.0
SUM(CAST(c_non_numeric AS DOUBLE)) NULL NULL

okumin avatar May 13 '24 01:05 okumin

@okumin

I believe you mentioned how to correct the following semantics. I will try. Can I resolve it in this ticket?

It is ok to file another jira to track this.

~~Sorry I'm confused. Is #5245 somehow related to this patch? Could you please clarify.~~ Found the change in cbo_rp_groupby3_noskew_multi_distinct.q

kasakrisz avatar May 13 '24 12:05 kasakrisz

@okumin #5245 was meged to master. Could you please rebase this patch.

kasakrisz avatar Jun 04 '24 07:06 kasakrisz

Rebased & followed your comments. Waiting for CI to finish... https://github.com/apache/hive/pull/5091/commits/e78e27bc593c20a312c76b7ba9c06b96176bc647

okumin avatar Jun 04 '24 11:06 okumin

Quality Gate Passed Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

sonarqubecloud[bot] avatar Jun 04 '24 13:06 sonarqubecloud[bot]

CI passed.

as you mentioned: "I believe SUM should return NULL if all rows are evaluated as NULL" on both CBO and non-CBO path. Swapping these lines may help to solve this:

BTW, I filed another ticket for this change. https://issues.apache.org/jira/browse/HIVE-28302

okumin avatar Jun 05 '24 06:06 okumin