presto
presto copied to clipboard
Fix min and max for inputs that include NaN values
Description
min and max used to return NaN if the very first value in the input was NaN, but they ignored NaN values otherwise.
After this change min and max return NaN only if all inputs are NaN.
Motivation and Context
Fixes #21877
Impact
Ensure correct value of max/min()
not being affected by order of NaNs
Test Plan
- Run
./mvnw clean install
in current PR - Execute following snippet
select max(x) from (values 4.0,nan(),null) T(x);
select max(x) from (values nan(),4,null) T(x);
- Results in
presto> select max(x) from (values 4.0,nan(),null) T(x);
_col0
-------
4.0
(1 row)
Query 20240209_123313_00001_skyr3, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 172ms, server-side: 143ms] [0 rows, 0B] [0 rows/s, 0B/s]
presto> select max(x) from (values nan(),4,null) T(x);
_col0
-------
4.0
(1 row)
Query 20240209_123324_00002_skyr3, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 64ms, server-side: 47ms] [0 rows, 0B] [0 rows/s, 0B/s]
Contributor checklist
- [x] Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
- [x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
- [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality.
- [ ] If release notes are required, they follow the release notes guidelines.
- [ ] Adequate tests were added if applicable.
- [x] CI passed.
Release Notes
Please follow release notes guidelines and fill in the release notes below.
== RELEASE NOTES ==
General Changes
* Fixes issue where NaN value changes final result of aggregation max/min function if given as first item argument.
Can you add some unit tests as well?
Sure, let me add one!
Commits squashed and adhere to commit message guideline :D
@hainenber @tdcmeehan @mbasmanova This behavior is inconsistent with array_min/array_max which both return NaN when found at least one NaN element. Do we want to keep max and array_max inconsistent? Or do we want to make them behave in the same way (either/or)? Make sense to double-check min_by/max_by too.
@spershin as mentioned in the issue, in my opinion that's a separate problem. I think it requires discussion and thought on how to make them consistent, and this fix can be considered separately.
There's an existing issue for that https://github.com/prestodb/presto/issues/21065
@tdcmeehan
We are changing the behavior of min/max here. Would it not be wise to hold on for a while, discuss the min/max over double issue as a whole and then roll out a change? To, potentially, disturb users less?
@spershin if we were to document this function better, it would be written something along these lines:
Returns the maximum value of all input values. For double and float types, NaN values are ignored, except if the first value is NaN, then the function will always return NaN.
This behavior seems almost ridiculous to document, and very counterintuitive. When I look at it through a risk/reward lense, it seems there is more harm to leave this behavior as it is because it is so unintuitive and difficult to document, and my worry is if we tie it to the larger question of how do we consistently handle NaN values in our aggregation functions, then it's going to hold up a relatively straightforward bug fix. If we merge this change, we have a relatively easy way to document this function, and can still consider larger consistency issues with NaN. But I am curious if you think differently on the risk/reward for this smaller change.
@tdcmeehan
Looking from the position of "how will it affect existing users and their workloads" when we make a change like this. Pipelines, dashboards and UIs can potentially start producing very different results with each such change and it can have disruptions for users/clients even if previously it wasn't quite right.
I don't really think much about the documentation, probably because I believe that not many users went and studied the exact behavior of min/max in the manual in the 1st place. Because the function is pretty simple and unless you scrutinize something you won't need that info.
I don't have a strong opinion on this topic. Simply tried to project a set of opinions I'm encountering all the time when it comes to "should we change that part of Presto, because it seems wrong" question.
I'm seeing that there are 2 distinct issues with handling NaNs. (1) different functions handle NaNs differently (for example, min vs. array_min); (2) some functions handle NaNs inconsistently (min and max return different results depending on the order of inputs).
It might be hard to fix (1), but we should be able to fix (2). My impression is that this PR fixes (2) for min and max. BTW, there is a similar issue in min_by and max_by.
@hainenber It looks like you are fixing min and max to always ignore NaN values in the input. If that's the case, let's state this clearly in the PR description and commit message and update documentation at presto-docs/src/main/sphinx/functions/aggregate.rst
I couldn't find anything in the SQL standard that talks about treatment of NaN in aggregates but I agree with comments above that at the very least the behavior should be consistent across various aggregates and built-in functions, including array aggs and least/greatest
I couldn't find anything in the SQL standard that talks about treatment of NaN in aggregates but I agree with comments above that at the very least the behavior should be consistent across various aggregates and built-in functions, including array aggs and least/greatest
Also, most systems I checked with the exception of BigQuery don't specify what behavior is followed wrt NaN (they only do so for NULLs). If we ignore NaN, it makes it consistent with NULL-value behavior. For Bigquery, they return NaN if any of the values is NaN: https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions#min
@hainenber , have you checked what other systems do with NaNs? Postgres, MySQL, SqlServer etc?
I couldn't find anything in the SQL standard that talks about treatment of NaN in aggregates but I agree with comments above that at the very least the behavior should be consistent across various aggregates and built-in functions, including array aggs and least/greatest
Also, most systems I checked with the exception of BigQuery don't specify what behavior is followed wrt NaN (they only do so for NULLs). If we ignore NaN, it makes it consistent with NULL-value behavior. For Bigquery, they return NaN if any of the values is NaN: https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions#min
@hainenber , have you checked what other systems do with NaNs? Postgres, MySQL, SqlServer etc?
This is what I've got after few rounds of searching and LLM'ing
- PostgreSQL treats NaN values as equal, and greater than all non-NaN values. (https://www.postgresql.org/docs/current/datatype-numeric.html)
- MySQL and SQL Server converts NaN as Null and perform comparison accordingly.
Codenotify: Notifying subscribers in CODENOTIFY files for diff 75ebaf15f176a0b1b4a71074d249ae5f350e859e...9d97932826ccbba45ae86202943a4a82cde92387.
Notify | File(s) |
---|---|
@steveburnett | presto-docs/src/main/sphinx/functions/aggregate.rst |
Done updating aggregate.rst
for min
and max
function. I also clarified the solution in the PR message/description.
I also added tests for positive/negative infinity and null as requested.
Hi there, this PR is free for other folks taking over. Thanks for your effort reviewing previously!
I see the logic behind this change, as it seems to achieve what the original function was trying to do. However, I agree with @spershin that we should wait to fix the behavior for min and max until we have a plan for how we want to handle NaNs across the board. This is a long standing issue, and given that we expect to have a plan for NaNs imminently, it doesn't seem worth changing the behavior for these functions 2x. Also, the behavior here almost definitely will not be the final behavior we want for these function once we have consistent handling for NaNs. (we would either consider nans as largest or would return nan if there are any nans in the input).
@rschlussel while I agree with your sentiment, I still find this bug fix to be a relatively minor change in scope and an obvious bug fix. Just in case it wasn't clear, consider once again what it would take to document the existing behavior:
Returns the maximum value of all input values. For double and float types, NaN values are ignored, except if the first value is NaN, then the function will always return NaN.
After this change, the documentation would look like this:
Returns the maximum value of all input values. When x is a REAL or DOUBLE, returns NaN if all input values are NaN. Ignores NaN values otherwise.
Don't you find this to be a small, sensible change that may improve the experience of Presto users, that doesn't hold up our larger fixes related to NaN handling?
That's fair. Yes, I agree the old behavior is very clearly a bug. And given that someone has taken the time to fix it, you're right, we should merge it. However, i think it's also true that in practice we haven't been providing any guarantees to our users regarding treatment of NaNs, and I don't think it's worth investing in fixing other similar bugs until we move toward a more cohesive solution.
@rschlussel should this be closed in favor of #22386?
yes, let's close this one