velox icon indicating copy to clipboard operation
velox copied to clipboard

BaseVector::hashValueAt is not consistent with BaseVector::equalValueAt for NaNs

Open mbasmanova opened this issue 10 months ago • 2 comments

Bug description

Given two equal values we expect hashes of these values to be the same. The opposite is not true (hash collisions).

However, given two different representations of NaN (for example, std::nan("1") and std::nan("2")) BaseVector::equalValueAt returns true, while BaseVector::hashValueAt returns 2 different hashes.

A simple repro:

TEST_F(AggregationTest, xxx) {
  auto v = makeFlatVector<double>({std::nan("1"), std::nan("2")});

  auto h1 = v->hashValueAt(0);
  auto h2 = v->hashValueAt(1);

  EXPECT_EQ(h1, h2);
  EXPECT_TRUE(v->equalValueAt(v.get(), 0, 1));

...
Expected equality of these values:
  h1
    Which is: 12470945178970240804
  h2
    Which is: 8927572481385630149

Similarly, when aggregating with grouping keys that contain different NaNs, Velox puts each NaN into separate group.

  auto data = makeRowVector({v});

  auto plan =
      PlanBuilder().values({data}).singleAggregation({"c0"}, {}).planNode();

  auto r = AssertQueryBuilder(plan).copyResults(pool());
  LOG(ERROR) << r->toString();
  LOG(ERROR) << r->toString(0, 10);

...
[ROW ROW<c0:DOUBLE>: 2 elements, no nulls]
0: {NaN}
1: {NaN}

This behavior seems to be due to usage of folly::hasher<float> and folly::hasher<double>. In Folly, it makes sense that hash of different NaNs is different because C++ implement IEEE 754 standard where NaN is never equal to NaN.

However, SQL implementations do not follow IEEE 754 standard and expect all NaNs to be equal. This behavior is followed in BaseVector::compare and equalValueAt, but in hash computation.

  FOLLY_ALWAYS_INLINE static int comparePrimitiveAsc(
      const T& left,
      const T& right) {
    if constexpr (std::is_floating_point<T>::value) {
      bool isLeftNan = std::isnan(left);
      bool isRightNan = std::isnan(right);
      if (isLeftNan) {
        return isRightNan ? 0 : 1;
      }
      if (isRightNan) {
        return -1;
      }
    }
    return left < right ? -1 : left == right ? 0 : 1;
  }

Hash computation needs to be fixed in multiple places.

  • BaseVector::hashValueAt
  • VectorHasher::hashValues
  • RowContainer::hashTyped
  • any other place that uses folly::hasher

CC: @rschlussel @kagamiori @kgpai @rui-mo

This issue was discovered during code review: https://github.com/facebookincubator/velox/pull/9272#issuecomment-2031169744

System information

n/a

Relevant logs

No response

mbasmanova avatar Apr 02 '24 13:04 mbasmanova

Having branching every time when we do comparison/hashing is expensive. The most efficient way should be we normalize NaNs to the same value beforehand.

Yuhta avatar Apr 02 '24 16:04 Yuhta

Hash computation needs to be fixed in multiple places.

  • BaseVector::hashValueAt
  • VectorHasher::hashValues
  • RowContainer::hashTyped
  • any other place that uses folly::hasher

Spark performs NaN normalization on keys of aggregation, join and window operations. I wonder if there are other scenarios where this inconsistency has caused issues? Thanks.

zhli1142015 avatar Apr 03 '24 01:04 zhli1142015

@zhli1142015 Any code that relies on BaseVector::hashValueAt is at risk for handling NaN values incorrecty.

mbasmanova avatar Apr 08 '24 14:04 mbasmanova

Thanks. I check Spark in function also produces wrong result for NaN values.

zhli1142015 avatar Apr 08 '24 14:04 zhli1142015

Hello @mbasmanova and @Yuhta , May I ask what's the next step for this issue? It seems like the correct behavior for Spark and Presto is completely different. Spark expects different NaNs should be treated as same, -0.0 and 0.0 should be treated as same. But Presto thinks they should be treated as different. Change for presto: https://github.com/facebookincubator/velox/pull/9511 Change for spark: https://github.com/facebookincubator/velox/pull/9272 cc @FelixYBW , @rui-mo also.

zhli1142015 avatar Apr 25 '24 12:04 zhli1142015

For Presto, NaNs should be treated the same. We haven't decided on +/-0. Having them be treated as different is easiest from a code perspective because it aligns with built in java implementations for compare/hashcode/equals. But if it's preferred we could use our own implementations with extra logic for handling +/-0

rschlussel avatar Apr 25 '24 13:04 rschlussel

Thanks. Get it, so the difference is only related to +/-0.0.

zhli1142015 avatar Apr 25 '24 13:04 zhli1142015

@rschlussel We will need to either normalize NaNs or use custom comparator anyway, I think a better way is to also normalize -0.0 to 0.0 or handle them in custom comparator as well

Yuhta avatar Apr 25 '24 14:04 Yuhta

normalizing seems tricky (would need to normalize after every operation on a double), but we can do a custom comparator.

rschlussel avatar Apr 25 '24 15:04 rschlussel