presto Fix min and max for inputs that include NaN values

Description

min and max used to return NaN if the very first value in the input was NaN, but they ignored NaN values otherwise.

After this change min and max return NaN only if all inputs are NaN.

Motivation and Context

Fixes #21877

Impact

Ensure correct value of max/min() not being affected by order of NaNs

Test Plan

Run ./mvnw clean install in current PR
Execute following snippet

select max(x) from (values 4.0,nan(),null) T(x);
select max(x) from (values nan(),4,null) T(x);

Results in

presto> select max(x) from (values 4.0,nan(),null) T(x);
 _col0 
-------
   4.0 
(1 row)

Query 20240209_123313_00001_skyr3, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 172ms, server-side: 143ms] [0 rows, 0B] [0 rows/s, 0B/s]

presto> select max(x) from (values nan(),4,null) T(x);
 _col0 
-------
   4.0 
(1 row)

Query 20240209_123324_00002_skyr3, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 64ms, server-side: 47ms] [0 rows, 0B] [0 rows/s, 0B/s]

Contributor checklist

[x] Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
[x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
[ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality.
[ ] If release notes are required, they follow the release notes guidelines.
[ ] Adequate tests were added if applicable.
[x] CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Fixes issue where NaN value changes final result of aggregation max/min function if given as first item argument.

Feb 09 '24 12:02 hainenber

Can you add some unit tests as well?

Sure, let me add one!

Feb 11 '24 03:02 hainenber

Commits squashed and adhere to commit message guideline :D

Feb 12 '24 04:02 hainenber

@hainenber @tdcmeehan @mbasmanova This behavior is inconsistent with array_min/array_max which both return NaN when found at least one NaN element. Do we want to keep max and array_max inconsistent? Or do we want to make them behave in the same way (either/or)? Make sense to double-check min_by/max_by too.

Feb 14 '24 23:02 spershin

@spershin as mentioned in the issue, in my opinion that's a separate problem. I think it requires discussion and thought on how to make them consistent, and this fix can be considered separately.

Feb 14 '24 23:02 tdcmeehan

There's an existing issue for that https://github.com/prestodb/presto/issues/21065

Feb 14 '24 23:02 tdcmeehan

@tdcmeehan

We are changing the behavior of min/max here. Would it not be wise to hold on for a while, discuss the min/max over double issue as a whole and then roll out a change? To, potentially, disturb users less?

Feb 14 '24 23:02 spershin

@spershin if we were to document this function better, it would be written something along these lines:

Returns the maximum value of all input values. For double and float types, NaN values are ignored, except if the first value is NaN, then the function will always return NaN.

This behavior seems almost ridiculous to document, and very counterintuitive. When I look at it through a risk/reward lense, it seems there is more harm to leave this behavior as it is because it is so unintuitive and difficult to document, and my worry is if we tie it to the larger question of how do we consistently handle NaN values in our aggregation functions, then it's going to hold up a relatively straightforward bug fix. If we merge this change, we have a relatively easy way to document this function, and can still consider larger consistency issues with NaN. But I am curious if you think differently on the risk/reward for this smaller change.

Feb 15 '24 00:02 tdcmeehan

@tdcmeehan

Looking from the position of "how will it affect existing users and their workloads" when we make a change like this. Pipelines, dashboards and UIs can potentially start producing very different results with each such change and it can have disruptions for users/clients even if previously it wasn't quite right.

I don't really think much about the documentation, probably because I believe that not many users went and studied the exact behavior of min/max in the manual in the 1st place. Because the function is pretty simple and unless you scrutinize something you won't need that info.

I don't have a strong opinion on this topic. Simply tried to project a set of opinions I'm encountering all the time when it comes to "should we change that part of Presto, because it seems wrong" question.

Feb 15 '24 00:02 spershin

I'm seeing that there are 2 distinct issues with handling NaNs. (1) different functions handle NaNs differently (for example, min vs. array_min); (2) some functions handle NaNs inconsistently (min and max return different results depending on the order of inputs).

It might be hard to fix (1), but we should be able to fix (2). My impression is that this PR fixes (2) for min and max. BTW, there is a similar issue in min_by and max_by.

@hainenber It looks like you are fixing min and max to always ignore NaN values in the input. If that's the case, let's state this clearly in the PR description and commit message and update documentation at presto-docs/src/main/sphinx/functions/aggregate.rst

Feb 15 '24 12:02 mbasmanova

I couldn't find anything in the SQL standard that talks about treatment of NaN in aggregates but I agree with comments above that at the very least the behavior should be consistent across various aggregates and built-in functions, including array aggs and least/greatest

Feb 15 '24 23:02 mlyublena

I couldn't find anything in the SQL standard that talks about treatment of NaN in aggregates but I agree with comments above that at the very least the behavior should be consistent across various aggregates and built-in functions, including array aggs and least/greatest

Also, most systems I checked with the exception of BigQuery don't specify what behavior is followed wrt NaN (they only do so for NULLs). If we ignore NaN, it makes it consistent with NULL-value behavior. For Bigquery, they return NaN if any of the values is NaN: https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions#min

@hainenber , have you checked what other systems do with NaNs? Postgres, MySQL, SqlServer etc?

Feb 15 '24 23:02 mlyublena

I couldn't find anything in the SQL standard that talks about treatment of NaN in aggregates but I agree with comments above that at the very least the behavior should be consistent across various aggregates and built-in functions, including array aggs and least/greatest

Also, most systems I checked with the exception of BigQuery don't specify what behavior is followed wrt NaN (they only do so for NULLs). If we ignore NaN, it makes it consistent with NULL-value behavior. For Bigquery, they return NaN if any of the values is NaN: https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions#min

@hainenber , have you checked what other systems do with NaNs? Postgres, MySQL, SqlServer etc?

This is what I've got after few rounds of searching and LLM'ing

PostgreSQL treats NaN values as equal, and greater than all non-NaN values. (https://www.postgresql.org/docs/current/datatype-numeric.html)
MySQL and SQL Server converts NaN as Null and perform comparison accordingly.

Feb 16 '24 01:02 hainenber

Codenotify: Notifying subscribers in CODENOTIFY files for diff 75ebaf15f176a0b1b4a71074d249ae5f350e859e...9d97932826ccbba45ae86202943a4a82cde92387.

Notify	File(s)
@steveburnett	presto-docs/src/main/sphinx/functions/aggregate.rst

Feb 16 '24 01:02 github-actions[bot]

Done updating aggregate.rst for min and max function. I also clarified the solution in the PR message/description.

I also added tests for positive/negative infinity and null as requested.

Feb 16 '24 01:02 hainenber

Hi there, this PR is free for other folks taking over. Thanks for your effort reviewing previously!

Mar 08 '24 16:03 hainenber

I see the logic behind this change, as it seems to achieve what the original function was trying to do. However, I agree with @spershin that we should wait to fix the behavior for min and max until we have a plan for how we want to handle NaNs across the board. This is a long standing issue, and given that we expect to have a plan for NaNs imminently, it doesn't seem worth changing the behavior for these functions 2x. Also, the behavior here almost definitely will not be the final behavior we want for these function once we have consistent handling for NaNs. (we would either consider nans as largest or would return nan if there are any nans in the input).

Mar 08 '24 20:03 rschlussel

@rschlussel while I agree with your sentiment, I still find this bug fix to be a relatively minor change in scope and an obvious bug fix. Just in case it wasn't clear, consider once again what it would take to document the existing behavior:

Returns the maximum value of all input values. For double and float types, NaN values are ignored, except if the first value is NaN, then the function will always return NaN.

After this change, the documentation would look like this:

Returns the maximum value of all input values. When x is a REAL or DOUBLE, returns NaN if all input values are NaN. Ignores NaN values otherwise.

Don't you find this to be a small, sensible change that may improve the experience of Presto users, that doesn't hold up our larger fixes related to NaN handling?

Mar 11 '24 14:03 tdcmeehan

That's fair. Yes, I agree the old behavior is very clearly a bug. And given that someone has taken the time to fix it, you're right, we should merge it. However, i think it's also true that in practice we haven't been providing any guarantees to our users regarding treatment of NaNs, and I don't think it's worth investing in fixing other similar bugs until we move toward a more cohesive solution.

Mar 11 '24 14:03 rschlussel

@rschlussel should this be closed in favor of #22386?

Jun 05 '24 14:06 tdcmeehan

yes, let's close this one

Jun 05 '24 14:06 rschlussel

presto presto copied to clipboard

Fix min and max for inputs that include NaN values

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

presto
presto copied to clipboard