velox
velox copied to clipboard
BaseVector::hashValueAt is not consistent with BaseVector::equalValueAt for NaNs
Bug description
Given two equal values we expect hashes of these values to be the same. The opposite is not true (hash collisions).
However, given two different representations of NaN (for example, std::nan("1") and std::nan("2")) BaseVector::equalValueAt returns true, while BaseVector::hashValueAt returns 2 different hashes.
A simple repro:
TEST_F(AggregationTest, xxx) {
auto v = makeFlatVector<double>({std::nan("1"), std::nan("2")});
auto h1 = v->hashValueAt(0);
auto h2 = v->hashValueAt(1);
EXPECT_EQ(h1, h2);
EXPECT_TRUE(v->equalValueAt(v.get(), 0, 1));
...
Expected equality of these values:
h1
Which is: 12470945178970240804
h2
Which is: 8927572481385630149
Similarly, when aggregating with grouping keys that contain different NaNs, Velox puts each NaN into separate group.
auto data = makeRowVector({v});
auto plan =
PlanBuilder().values({data}).singleAggregation({"c0"}, {}).planNode();
auto r = AssertQueryBuilder(plan).copyResults(pool());
LOG(ERROR) << r->toString();
LOG(ERROR) << r->toString(0, 10);
...
[ROW ROW<c0:DOUBLE>: 2 elements, no nulls]
0: {NaN}
1: {NaN}
This behavior seems to be due to usage of folly::hasher<float>
and folly::hasher<double>
. In Folly, it makes sense that hash of different NaNs is different because C++ implement IEEE 754 standard where NaN is never equal to NaN.
However, SQL implementations do not follow IEEE 754 standard and expect all NaNs to be equal. This behavior is followed in BaseVector::compare and equalValueAt, but in hash computation.
FOLLY_ALWAYS_INLINE static int comparePrimitiveAsc(
const T& left,
const T& right) {
if constexpr (std::is_floating_point<T>::value) {
bool isLeftNan = std::isnan(left);
bool isRightNan = std::isnan(right);
if (isLeftNan) {
return isRightNan ? 0 : 1;
}
if (isRightNan) {
return -1;
}
}
return left < right ? -1 : left == right ? 0 : 1;
}
Hash computation needs to be fixed in multiple places.
- BaseVector::hashValueAt
- VectorHasher::hashValues
- RowContainer::hashTyped
- any other place that uses folly::hasher
CC: @rschlussel @kagamiori @kgpai @rui-mo
This issue was discovered during code review: https://github.com/facebookincubator/velox/pull/9272#issuecomment-2031169744
System information
n/a
Relevant logs
No response
Having branching every time when we do comparison/hashing is expensive. The most efficient way should be we normalize NaNs to the same value beforehand.
Hash computation needs to be fixed in multiple places.
- BaseVector::hashValueAt
- VectorHasher::hashValues
- RowContainer::hashTyped
- any other place that uses folly::hasher
Spark performs NaN normalization on keys of aggregation, join and window operations. I wonder if there are other scenarios where this inconsistency has caused issues? Thanks.
@zhli1142015 Any code that relies on BaseVector::hashValueAt is at risk for handling NaN values incorrecty.
Thanks. I check Spark in
function also produces wrong result for NaN values.
Hello @mbasmanova and @Yuhta , May I ask what's the next step for this issue? It seems like the correct behavior for Spark and Presto is completely different. Spark expects different NaNs should be treated as same, -0.0 and 0.0 should be treated as same. But Presto thinks they should be treated as different. Change for presto: https://github.com/facebookincubator/velox/pull/9511 Change for spark: https://github.com/facebookincubator/velox/pull/9272 cc @FelixYBW , @rui-mo also.
For Presto, NaNs should be treated the same. We haven't decided on +/-0. Having them be treated as different is easiest from a code perspective because it aligns with built in java implementations for compare/hashcode/equals. But if it's preferred we could use our own implementations with extra logic for handling +/-0
Thanks. Get it, so the difference is only related to +/-0.0.
@rschlussel We will need to either normalize NaNs or use custom comparator anyway, I think a better way is to also normalize -0.0 to 0.0 or handle them in custom comparator as well
normalizing seems tricky (would need to normalize after every operation on a double), but we can do a custom comparator.