fix: use total ordering in the min & max accumulator for floats
Which issue does this PR close?
Closes #8031
Rationale for this change
See the motivating issue.
What changes are included in this PR?
The MinAccumulator and MaxAccumlator now use ScalarValue::partial_cmp to compare two float scalars instead of f32::min and f32::max (or f64). This is an approach that was already being taken for intervals.
Are these changes tested?
Yes, I added a unit test. I did not see any existing benchmarks for the accumulators. I'm not entirely certain if this approach is slower or not. However, I think a performance regression is unlikely since this only changes how we compare the intermediate results. I.e. given two arrays we will use the arrow kernels to find the min/max of the two arrays. Then we only use this path to compare the result of those two calculations. (e.g. we are in per-batch space and not per-row space)
Are there any user-facing changes?
No.
I had to change the describe test because the float / double columns have nans (or maybe -inf?). This means that the min, which used to be a value, is now nan (which doesn't seem to display).
I wonder if describe should filter out nan values when calculating min/max?
I had to change the
describetest because the float / double columns have nans (or maybe -inf?). This means that themin, which used to be a value, is now nan (which doesn't seem to display).I wonder if describe should filter out nan values when calculating min/max?
FWI I checed what postgres does:
postgres=# create table foo(x float);
CREATE TABLE
postgres=# insert into foo values (1), (2), ('NaN');
INSERT 0 3
postgres=# select * from foo;
x
-----
1
2
NaN
(3 rows)
postgres=# select min(x) from foo;
min
-----
1
(1 row)
postgres=# select max(x) from foo;
max
-----
NaN
(1 row)
So that suggests to me it treats NaN as the largest floating point value
BTW here is a test that shows inconsistent Nan handling for f32 and f64 (just to help make the current behavior clearer): https://github.com/apache/datafusion/pull/10634
So that suggests to me it treats NaN as the largest floating point value
If this is the case then there is divergence between postgres and arrow-rs. Which takes priority?
If you want to follow the postgres behavior then the change will be more complex. You will need a custom max function for arrays (instead of using arrow_arith::aggregate::max) in addition to changes in the accumulator.
So that suggests to me it treats NaN as the largest floating point value
This is confirmed in the latest version of the postgres docs:
IEEE 754 specifies that NaN should not compare equal to any other floating-point value (including NaN). In order to allow floating-point values to be sorted and used in tree-based indexes, PostgreSQL treats NaN values as equal, and greater than all non-NaN values.
If this is the case then there is divergence between postgres and arrow-rs. Which takes priority?
I would personally suggest we do whatever is consistent with arrow (and easier) unless there is a compelling reason to do something different
@westonpace what is the status / plan with this PR? It has failing CI tests but is not marked as a draft. Are you still planning on working with it? Do you need help to push it along?
@westonpace what is the status / plan with this PR? It has failing CI tests but is not marked as a draft. Are you still planning on working with it? Do you need help to push it along?
Thanks for the push. I have now discovered sqllogictests :)
Turns out my approach didn't work because partial_cmp propagates nulls (and the previous impl ignored nulls). I think I've got it fixed now. The earlier describe test failure was also for this same reason (it was propagating a null, not a nan) and so I was able to revert that change.
I suspect this means that the min/max function for intervals is also incorrectly propagating nulls but I tried to make a unit test for intervals and get an error that there is no accumulator for intervals and so I guess we can cross that bridge when we come to it.
@alamb (See above reply, forgot to ping-reply you)