presto
presto copied to clipboard
Clarifications on comparisons with NaN
Current Behavior
Currently in Presto NaN() seems to behave like its > infinity,
select ARRAY_SORT(a) from (values array[NAN(), INFINITY(), -INFINITY(), 1]) as T(a);
— [ "-Infinity", 1, "Infinity", "NaN" ]
select ARRAY_SORT_Desc(a) from (values array[NAN(), INFINITY(), -INFINITY(), 1]) as T(a);
— ["NaN", "Infinity", 1, "-Infinity"]
This is not just limited to array_sort and you can see this behavior in order by too:
SELECT x, y from (values (1, 1.0), (2, NAN()), (3, -NAN()), (4, INFINITY()), (5, -INFINITY())) as T(x, y) order by y;
—
5 -Infinity
1 1
4 Infinity
2 NaN
3 NaN
However running
SELECT NAN() > INFINITY()
-- False
gives a contradictory answer.
Expected Behavior
Comparison with NaN is not defined so we should confirm to https://en.wikipedia.org/wiki/NaN#Comparison_with_NaN.
Context
Velox is trying to ensure correctness with Presto Java but we are seeing inconsistencies in behavior and wanted clarifications on these.
cc: @mbasmanova @spershin @tdcmeehan @Yuhta @kagamiori
CC: @kaikalur @feilong-liu
As discussed in #21877, I am not sure if it's most appropriate to follow the IEEE standard, or to follow the behavior of most engines. It appears that behavior of NaN is not defined in the spec, so we are actually free to follow whatever behavior is most intuitive to users--in my opinion, this should be consistent with that of other engines. The real requirement is to make them consistent, and document the behavior well for functions.
I've been looking into the NaN inconsistencies, and wanted to give an update/summary of my thinking. If we can get alignment on the solution we can start tackling the problem
reference document of presto’s treatment of NaN for comparison and ordering across all functions and operators: https://docs.google.com/spreadsheets/d/1KclsskH88CLHMiu_Q2P25S3QfPc9H3Tu7h_9MXP10Dk/edit#gid=128250498
What does IEEE-754 say about NaN
IEEE specifies that operations on NaN return false, and that comparison operations return false except for inequality which returns true. (I found a section online here after seeing references to total order: https://gist.github.com/leto/176888/e1bf1cf76d56a3da34b6be5a0f6b95139a747299) It also specifies a total order of doubles as follows
- -quietNan
- -signalingNan
- regular expected ordering of doubles
- +signalingNaN
- +quietNan
what people usually expect should be true about numbers
There are a couple concepts that people intuitively expect to be consistent:
- equality/inequality and distinctness, groupings, etc.
- <, >, <=, >= and sort order
Presto current behavior
- comparison - basic operations follow IEEE754 (nan <> nan). joins do not match NaN keys, but distinct and group bys do consider NaN to be equal (deduplicates, considered a single group, etc.). Many map and array function do not consider NaN to be equal to itself, (will show up multiple times in maps, array distincts, unions, etc.), but the exact behavior is inconsistent.
- ordering - basic operations follow IEEE754 (NaN < or > anything is false). Order by puts NaNs as biggest. min and max have inconsistent behavior. greatest and least throw an error. topn functions on arrays and maps are inconsistent, but generally sort NaN as biggest. File format min/max stats differ by file format and may also differ based on who wrote the file.
A lot of the Presto inconsistency comes from the following: In Java, basic comparison operations on double type (<, >, =, !=, >-, <=) behave according to the IEEE754 definition. However, Double the object class has equals, compare() and hashcode all defined so that NaN are equal to each other and greater than anything else. Some of our functions and operators use the Object comparison or structures while others use the primitive definition (and some probably use a combo). There are still other operators that have special handling for nulls. Additionally, the primitive definition itself can lead to bugs when used for sorting if nan is not handled separately (e.g. https://github.com/prestodb/presto/issues/22040).
what to do:
How should we decide what the behavior of NaN should be?
- useful and reasonably intuitive behavior for users
- reasonably standard
- reasonable to implement and maintain as new functions and features are added.
Options
- IEEE definition without total order - Any basic comparisons with NaN return false, NaN<> NaN is true. NaN are not de-distincted. min, max etc. return nan. Would throw an error for sorting. Pros: a) IEEE754 compliant cons: a) not useful behavior for users if we throw an error for order by queries when input contains nan. (and possibly not sql compliant?) b) implementation might be tricky. and keeping consistency might be tricky.
- IEEE compliant with total order - Any basic comparisons with NaN return false, NaN<> NaN is true. NaN are not de-distincted. unclear what min/max should do. For sorting uses total order Pros: a) IEEE754 compliant cons: a) not intuitive that basic comparison operations don’t match sort order. operations like min/max or topn don't obviously fall into distinctions between sort and comparison b) might be surprising to users that nans wouldn’t be deduplicated for distinct, group by, etc. c) having comparison operations behave differently than sort makes this prone to inconsistency/bugs.
- NaN=NaN and NaN is biggest. Pros: a) many other DBs use this definition b) easy to maintain consistency across different kinds of functions and operations because of consistency with expectations that <> and distinct are the same and > is the same as > for the purposes of sorting. c) behavior is reasonably useful for users Cons: a) not IEEE compliant
Conclusion
I think option 3 (nan=nan and sorts largest) will be simplest to implement and maintain consistently, is reasonable behavior for users, and is aligned with many other dbs, and therefore reasonably standard even though it's not IEEE compliant.
@rschlussel Not sure what users we have in mind but from a numerical computing background, treating NaN as a normal number is very counter intuitive (making it larger than Inf is even insane, how could something be larger than Inf?!). And it definitely will introduce more bugs and harder to maintain than option 1 (throwing is always the safest way; not alway the most friendly though).
Assume we want to go with option 3, do we want to fix the comparison operations to reflect this new definition of NaN?
@rschlussel Not sure what users we have in mind but from a numerical computing background, treating NaN as a normal number is very counter intuitive (making it larger than Inf is even insane, how could something be larger than Inf?!). And it definitely will introduce more bugs and harder to maintain than option 1 (throwing is always the safest way; not alway the most friendly though).
Assume we want to go with option 3, do we want to fix the comparison operations to reflect this new definition of NaN?
That's a great point. I guess I was assuming that most users are not using the db for very mathematical use cases, but mostly nan are bad data and they are doing basic things like getting the top x by price. But it would be a good idea to ask some customers their preferences as well.
I am very hesitant about option 1. I think throwing an exception would likely be a bad user experience in most cases, especially given that there is nothing mandating we do that, and it seems like most dbs don't. I tried to see what the sql spec had to say about it. It says
If PVi and QVi are not the null value and the result of “PVi
QVi” is Unknown, then the relative ordering of PVi and QVi is implementation-dependent."
That to me suggests that for NaN values (where by IEEE comparisons are considered unordered), the sort order would be implementation dependent, but it doesn't seem like it should throw an exception.
For option 3, yes, we would redefine comparison operations to reflect the new definition of NaN (what makes it simple is that notions of distinctnesss, equality/inequality, ordering comparisons, and sorting are all behaving the same. The definition isn't standard to what you'd expect in mathematical computing, but it is internally consistent and easy to understand).
@rschlussel OK then at least we can have some consistent design in this space. The implementation would be messy though, because you will need to normalize all the different NaNs to the same place as +NaN on every newly generated floating point blocks if you want it to be efficient, and need to be careful whenever doing floating number comparison, we need to use the custom comparator instead of the built in operators.
@czentgr proposed the following behavior for Velox: https://github.com/facebookincubator/velox/pull/7237
@czentgr proposed the following behavior for Velox: facebookincubator/velox#7237
Thank you! It sounds like we are in agreement then. Let me collect approvals so I can move forward with this as well.
I think option 3 (nan=nan and sorts largest) will be simplest to implement and maintain consistently, is reasonable behavior for users, and is aligned with many other dbs, and therefore reasonably standard even though it's not IEEE compliant.
Agreed on this--this also seems most consistent with other engines, and it's easy to document as we see above.
Thanks for the comments on PR https://github.com/facebookincubator/velox/pull/7237. I will add the requested information and check.
Hi @rschlussel, not sure if you're already aware of this, but I found another "inconsistent" handling of NaN in Presto: distinct aggregation considers two NaN to be duplicates, while set_agg() function considers two NaN to be distinct.
SELECT
SET_AGG(c0)
FROM (
VALUES
(1.1),
(NAN()),
(NAN())
) t(c0); --- result is [1.1, NaN, NaN]
SELECT DISTINCT
c0
FROM (
VALUES
(1.1),
(NAN()),
(NAN())
) t(c0); --- result are two rows: 1.1 and NaN
Since the Presto documentation of set_agg says "Returns an array created from the distinct input x elements", I expect these two queries to treat NaN in the same way.
Thank you, yes, i did know about this difference. I made a wiki about how we handle nans across all of our functions that do comparison or ordering on doubles https://github.com/prestodb/presto/wiki/Presto-NaN-behavior