spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG] Failed test_hash_reduction_sum [DATAGEN_SEED=1700579573, INJECT_OOM, IGNORE_ORDER, INCOMPAT, APPROXIMATE_FLOAT] on CI

Open cindyyuanjiang opened this issue 2 years ago • 4 comments

Describe the bug Failed integration test with mismatch CPU and GPU outputs. Pipeline: rapids_integration-pre_release

Details:

FAILED ../../src/main/python/hash_aggregate_test.py::test_hash_reduction_sum[{'spark.rapids.sql.variableFloatAgg.enabled': 'true', 'spark.rapids.sql.castStringToFloat.enabled': 'true', 'spark.rapids.sql.batchSizeBytes': '250'}-Double][DATAGEN_SEED=1700579573, INJECT_OOM, IGNORE_ORDER, INCOMPAT, APPROXIMATE_FLOAT] - AssertionError: GPU and CPU float values are different [0, 'sum(a)']

--- CPU OUTPUT
+++ GPU OUTPUT
@@ -1 +1 @@
-Row(sum(a)=1.5640545749306894e+305)
+Row(sum(a)=inf)

Steps/Code to reproduce bug Integration testing

Expected behavior Should pass

cindyyuanjiang avatar Nov 21 '23 19:11 cindyyuanjiang

I think this case might even change run to run. Our aggregations do not guarantee an order that the sum will happen. And floating point is not truly commutative. My guess is that the GPU got unlucky in this case and it overflowed, where the CPU which did the SUM in a strict order, didn't have the same problem.

revans2 avatar Nov 21 '23 20:11 revans2

See comments in https://github.com/NVIDIA/spark-rapids/issues/10026, which I believe is another instance of this issue.

jbrennan333 avatar Jan 03 '24 19:01 jbrennan333

Here is a smaller version of the test case from #10026 that still shows a difference:

+-------------+
|b            |
+-------------+
|8.2052361E17 |
|-9.615448E24 |
|3.4028235E38 |
|-8.9130315E23|
|-3.4028235E38|
|4.86208753E14|
|2.825404E25  |
|9.7615254E-4 |
+-------------+

SUM CPU: 1.7713718887409737e+25
SUM GPU: 1.7718319043726909e+25
APPROX EQ: False

SORTED SUM CPU: 1.7718319043726909e+25
SORTED SUM GPU: 1.7718319043726909e+25
APPROX EQ: True

REVERSE SORTED SUM CPU: 1.7718319043726909e+25
REVERSE SORTED SUM GPU: 1.7718319043726909e+25
APPROX EQ: True

Note that in this case I got the same result on GPU every time. It's the CPU that gave a different result in the unsorted case. Once again, if you add an accumulation column, GPU matches the CPU for the unsorted case, even though I get a different value for GPU when I do a SUM.

+----------------------+----------------------+----------------------+
|b                     |cpu_accum             |gpu_accum             |
+----------------------+----------------------+----------------------+
|8.2052360892841984E17 |8.2052360892841984E17 |8.2052360892841984E17 |
|-9.615447782308683E24 |-9.615446961785074E24 |-9.615446961785074E24 |
|3.4028234663852886E38 |3.4028234663851923E38 |3.4028234663851923E38 |
|-8.913031508548466E23 |3.4028234663851832E38 |3.4028234663851832E38 |
|-3.4028234663852886E38|-1.0540321989765048E25|-1.0540321989765048E25|
|4.862087528448E14     |-1.0540321989278838E25|-1.0540321989278838E25|
|2.8254040876688576E25 |1.7713718887409737E25 |1.7713718887409737E25 |
|9.761525434441864E-4  |1.7713718887409737E25 |1.7713718887409737E25 |
+----------------------+----------------------+----------------------+

It appears that for the SUM, the order of operations in the CPU matches the input data order, but for GPU, it does not, so we sometimes get different results.

jbrennan333 avatar Jan 05 '24 15:01 jbrennan333

We might want to split out the float and double datagens for this test into a separate test, so that all of the deterministic ones are in one test, and the ones affected by order are in a separate test.

One observation I had when I was looking into this for floats/doubles is that order really matters when the magnitudes of the values are very different. Adding very small numbers to very large numbers can effectively ignore the small numbers, and the order can have an impact if enough small numbers add together to impact the final result. One way to make the test more resilient to this would be to ensure that all of the random numbers are within a smaller range. Maybe select an initial random starting value and then select a more restricted range around that initial value.

jbrennan333 avatar Apr 01 '24 14:04 jbrennan333