velox BaseVector::hashValueAt is not consistent with BaseVector::equalValueAt for NaNs

Bug description

Given two equal values we expect hashes of these values to be the same. The opposite is not true (hash collisions).

However, given two different representations of NaN (for example, std::nan("1") and std::nan("2")) BaseVector::equalValueAt returns true, while BaseVector::hashValueAt returns 2 different hashes.

A simple repro:

TEST_F(AggregationTest, xxx) {
  auto v = makeFlatVector<double>({std::nan("1"), std::nan("2")});

  auto h1 = v->hashValueAt(0);
  auto h2 = v->hashValueAt(1);

  EXPECT_EQ(h1, h2);
  EXPECT_TRUE(v->equalValueAt(v.get(), 0, 1));

...
Expected equality of these values:
  h1
    Which is: 12470945178970240804
  h2
    Which is: 8927572481385630149

Similarly, when aggregating with grouping keys that contain different NaNs, Velox puts each NaN into separate group.

  auto data = makeRowVector({v});

  auto plan =
      PlanBuilder().values({data}).singleAggregation({"c0"}, {}).planNode();

  auto r = AssertQueryBuilder(plan).copyResults(pool());
  LOG(ERROR) << r->toString();
  LOG(ERROR) << r->toString(0, 10);

...
[ROW ROW<c0:DOUBLE>: 2 elements, no nulls]
0: {NaN}
1: {NaN}

This behavior seems to be due to usage of folly::hasher<float> and folly::hasher<double>. In Folly, it makes sense that hash of different NaNs is different because C++ implement IEEE 754 standard where NaN is never equal to NaN.

However, SQL implementations do not follow IEEE 754 standard and expect all NaNs to be equal. This behavior is followed in BaseVector::compare and equalValueAt, but in hash computation.

  FOLLY_ALWAYS_INLINE static int comparePrimitiveAsc(
      const T& left,
      const T& right) {
    if constexpr (std::is_floating_point<T>::value) {
      bool isLeftNan = std::isnan(left);
      bool isRightNan = std::isnan(right);
      if (isLeftNan) {
        return isRightNan ? 0 : 1;
      }
      if (isRightNan) {
        return -1;
      }
    }
    return left < right ? -1 : left == right ? 0 : 1;
  }

Hash computation needs to be fixed in multiple places.

BaseVector::hashValueAt
VectorHasher::hashValues
RowContainer::hashTyped
any other place that uses folly::hasher

CC: @rschlussel @kagamiori @kgpai @rui-mo

This issue was discovered during code review: https://github.com/facebookincubator/velox/pull/9272#issuecomment-2031169744

System information

n/a

Relevant logs

No response

Apr 02 '24 13:04 mbasmanova

Having branching every time when we do comparison/hashing is expensive. The most efficient way should be we normalize NaNs to the same value beforehand.

Apr 02 '24 16:04 Yuhta

Hash computation needs to be fixed in multiple places.

BaseVector::hashValueAt

VectorHasher::hashValues

RowContainer::hashTyped

any other place that uses folly::hasher

Spark performs NaN normalization on keys of aggregation, join and window operations. I wonder if there are other scenarios where this inconsistency has caused issues? Thanks.

Apr 03 '24 01:04 zhli1142015

@zhli1142015 Any code that relies on BaseVector::hashValueAt is at risk for handling NaN values incorrecty.

Apr 08 '24 14:04 mbasmanova

Thanks. I check Spark in function also produces wrong result for NaN values.

Apr 08 '24 14:04 zhli1142015

Hello @mbasmanova and @Yuhta , May I ask what's the next step for this issue? It seems like the correct behavior for Spark and Presto is completely different. Spark expects different NaNs should be treated as same, -0.0 and 0.0 should be treated as same. But Presto thinks they should be treated as different. Change for presto: https://github.com/facebookincubator/velox/pull/9511 Change for spark: https://github.com/facebookincubator/velox/pull/9272 cc @FelixYBW , @rui-mo also.

Apr 25 '24 12:04 zhli1142015

For Presto, NaNs should be treated the same. We haven't decided on +/-0. Having them be treated as different is easiest from a code perspective because it aligns with built in java implementations for compare/hashcode/equals. But if it's preferred we could use our own implementations with extra logic for handling +/-0

Apr 25 '24 13:04 rschlussel

Thanks. Get it, so the difference is only related to +/-0.0.

Apr 25 '24 13:04 zhli1142015

@rschlussel We will need to either normalize NaNs or use custom comparator anyway, I think a better way is to also normalize -0.0 to 0.0 or handle them in custom comparator as well

Apr 25 '24 14:04 Yuhta

normalizing seems tricky (would need to normalize after every operation on a double), but we can do a custom comparator.

Apr 25 '24 15:04 rschlussel

velox velox copied to clipboard

BaseVector::hashValueAt is not consistent with BaseVector::equalValueAt for NaNs

Bug description

System information

Relevant logs

velox
velox copied to clipboard