datafusion fix: use total ordering in the min & max accumulator for floats

Which issue does this PR close?

Closes #8031

Rationale for this change

See the motivating issue.

What changes are included in this PR?

The MinAccumulator and MaxAccumlator now use ScalarValue::partial_cmp to compare two float scalars instead of f32::min and f32::max (or f64). This is an approach that was already being taken for intervals.

Are these changes tested?

Yes, I added a unit test. I did not see any existing benchmarks for the accumulators. I'm not entirely certain if this approach is slower or not. However, I think a performance regression is unlikely since this only changes how we compare the intermediate results. I.e. given two arrays we will use the arrow kernels to find the min/max of the two arrays. Then we only use this path to compare the result of those two calculations. (e.g. we are in per-batch space and not per-row space)

Are there any user-facing changes?

No.

May 22 '24 20:05 westonpace

I had to change the describe test because the float / double columns have nans (or maybe -inf?). This means that the min, which used to be a value, is now nan (which doesn't seem to display).

I wonder if describe should filter out nan values when calculating min/max?

May 22 '24 21:05 westonpace

I had to change the describe test because the float / double columns have nans (or maybe -inf?). This means that the min, which used to be a value, is now nan (which doesn't seem to display).

I wonder if describe should filter out nan values when calculating min/max?

FWI I checed what postgres does:


postgres=# create table foo(x float);
CREATE TABLE

postgres=# insert into foo values (1), (2), ('NaN');
INSERT 0 3
postgres=# select * from foo;
  x
-----
   1
   2
 NaN
(3 rows)

postgres=# select min(x) from foo;
 min
-----
   1
(1 row)

postgres=# select max(x) from foo;
 max
-----
 NaN
(1 row)

So that suggests to me it treats NaN as the largest floating point value

May 23 '24 10:05 alamb

BTW here is a test that shows inconsistent Nan handling for f32 and f64 (just to help make the current behavior clearer): https://github.com/apache/datafusion/pull/10634

May 23 '24 10:05 alamb

So that suggests to me it treats NaN as the largest floating point value

If this is the case then there is divergence between postgres and arrow-rs. Which takes priority?

If you want to follow the postgres behavior then the change will be more complex. You will need a custom max function for arrays (instead of using arrow_arith::aggregate::max) in addition to changes in the accumulator.

May 24 '24 13:05 westonpace

So that suggests to me it treats NaN as the largest floating point value

This is confirmed in the latest version of the postgres docs:

IEEE 754 specifies that NaN should not compare equal to any other floating-point value (including NaN). In order to allow floating-point values to be sorted and used in tree-based indexes, PostgreSQL treats NaN values as equal, and greater than all non-NaN values.

May 24 '24 13:05 westonpace

If this is the case then there is divergence between postgres and arrow-rs. Which takes priority?

I would personally suggest we do whatever is consistent with arrow (and easier) unless there is a compelling reason to do something different

May 25 '24 09:05 alamb

@westonpace what is the status / plan with this PR? It has failing CI tests but is not marked as a draft. Are you still planning on working with it? Do you need help to push it along?

Jun 03 '24 13:06 alamb

@westonpace what is the status / plan with this PR? It has failing CI tests but is not marked as a draft. Are you still planning on working with it? Do you need help to push it along?

Thanks for the push. I have now discovered sqllogictests :)

Turns out my approach didn't work because partial_cmp propagates nulls (and the previous impl ignored nulls). I think I've got it fixed now. The earlier describe test failure was also for this same reason (it was propagating a null, not a nan) and so I was able to revert that change.

Jun 05 '24 18:06 westonpace

I suspect this means that the min/max function for intervals is also incorrectly propagating nulls but I tried to make a unit test for intervals and get an error that there is no accumulator for intervals and so I guess we can cross that bridge when we come to it.

Jun 05 '24 18:06 westonpace

@alamb (See above reply, forgot to ping-reply you)

Jun 06 '24 16:06 westonpace