spark-rapids
spark-rapids copied to clipboard
[BUG] hash_aggregate_test.py::test_exact_percentile_reduction failed with DATAGEN_SEED=1705866905
Describe the bug
[2024-01-21T17:44:19.558Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_exact_percentile_reduction[[('val', RepeatSeq(Double)), ('freq', Long(not_null))]][DATAGEN_SEED=1705857175][0m - AssertionError: GPU and CPU float values are different [0, 'percentile(val,...
Summary:
09:44:19 --- CPU OUTPUT
09:44:19 +++ GPU OUTPUT
09:44:19 @@ -1 +1 @@
09:44:19 -Row(percentile(val, CAST(0.1 AS DOUBLE), 1)=-3.0600528894266366e+181, percentile(val, CAST(0 AS DOUBLE), 1)=-2.4711378026196358e+293, percentile(val, CAST(1 AS DOUBLE), 1)=nan, percentile(val, array(0.1), 1)=[-3.0600528894266366e+181], percentile(val, array(), 1)=None, percentile(val, array(0.1, 0.5, 0.9), 1)=[-3.0600528894266366e+181, -4.9069119243789216e-275, 1.7532295949136916e+204], percentile(val, array(CAST(0 AS DECIMAL(14,4)), CAST(0.0001 AS DECIMAL(14,4)), CAST(0.5 AS DECIMAL(14,4)), CAST(0.9999 AS DECIMAL(14,4)), CAST(1 AS DECIMAL(14,4))), 1)=[-2.4711378026196358e+293, -2.4711378026196358e+293, -4.9069119243789216e-275, nan, nan], percentile(val, CAST(0.1 AS DOUBLE), abs(freq))=-1.3398677426484608e+183, percentile(val, CAST(0 AS DOUBLE), abs(freq))=-2.4711378026196358e+293, percentile(val, CAST(1 AS DOUBLE), abs(freq))=nan, percentile(val, array(0.1), abs(freq))=[-1.3398677426484608e+183], percentile(val, array(), abs(freq))=None, percentile(val, array(0.1, 0.5, 0.9), abs(freq))=[-1.3398677426484608e+183, -4.302064318624199e-276, 5.054511151289938e+220], percentile(val, array(CAST(0 AS DECIMAL(14,4)), CAST(0.0001 AS DECIMAL(14,4)), CAST(0.5 AS DECIMAL(14,4)), CAST(0.9999 AS DECIMAL(14,4)), CAST(1 AS DECIMAL(14,4))), abs(freq))=[-2.4711378026196358e+293, -2.4711378026196358e+293, -4.302064318624199e-276, nan, nan])
09:44:19 +Row(percentile(val, CAST(0.1 AS DOUBLE), 1)=-3.0600528894266366e+181, percentile(val, CAST(0 AS DOUBLE), 1)=-2.4711378026196358e+293, percentile(val, CAST(1 AS DOUBLE), 1)=nan, percentile(val, array(0.1), 1)=[-3.0600528894266366e+181], percentile(val, array(), 1)=None, percentile(val, array(0.1, 0.5, 0.9), 1)=[-3.0600528894266366e+181, -4.302064318624199e-276, 1.7532295949136916e+204], percentile(val, array(CAST(0 AS DECIMAL(14,4)), CAST(0.0001 AS DECIMAL(14,4)), CAST(0.5 AS DECIMAL(14,4)), CAST(0.9999 AS DECIMAL(14,4)), CAST(1 AS DECIMAL(14,4))), 1)=[-2.4711378026196358e+293, -2.4711378026196358e+293, -4.302064318624199e-276, nan, nan], percentile(val, CAST(0.1 AS DOUBLE), abs(freq))=-1.3398677426484608e+183, percentile(val, CAST(0 AS DOUBLE), abs(freq))=-2.4711378026196358e+293, percentile(val, CAST(1 AS DOUBLE), abs(freq))=nan, percentile(val, array(0.1), abs(freq))=[-1.3398677426484608e+183], percentile(val, array(), abs(freq))=None, percentile(val, array(0.1, 0.5, 0.9), abs(freq))=[-1.3398677426484608e+183, -4.302064318624199e-276, 5.054511151289938e+220], percentile(val, array(CAST(0 AS DECIMAL(14,4)), CAST(0.0001 AS DECIMAL(14,4)), CAST(0.5 AS DECIMAL(14,4)), CAST(0.9999 AS DECIMAL(14,4)), CAST(1 AS DECIMAL(14,4))), abs(freq))=[-2.4711378026196358e+293, -2.4711378026196358e+293, -4.302064318624199e-276, nan, nan])
Detailed output
_ test_exact_percentile_reduction[[('val', RepeatSeq(Double)), ('freq', Long(not_null))]] _ 09:44:19 09:44:19 data_gen = [('val', RepeatSeq(Double)), ('freq', Long(not_null))] 09:44:19 09:44:19 @pytest.mark.parametrize('data_gen', exact_percentile_reduction_data_gen, ids=idfn) 09:44:19 def test_exact_percentile_reduction(data_gen): 09:44:19 > assert_gpu_and_cpu_are_equal_collect( 09:44:19 lambda spark: exact_percentile_reduction(gen_df(spark, data_gen)) 09:44:19 ) 09:44:19 09:44:19 ../../src/main/python/hash_aggregate_test.py:922: 09:44:19 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 09:44:19 ../../src/main/python/asserts.py:595: in assert_gpu_and_cpu_are_equal_collect 09:44:19 _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare) 09:44:19 ../../src/main/python/asserts.py:517: in _assert_gpu_and_cpu_are_equal 09:44:19 assert_equal(from_cpu, from_gpu) 09:44:19 ../../src/main/python/asserts.py:107: in assert_equal 09:44:19 _assert_equal(cpu, gpu, float_check=get_float_check(), path=[]) 09:44:19 ../../src/main/python/asserts.py:43: in _assert_equal 09:44:19 _assert_equal(cpu[index], gpu[index], float_check, path + [index]) 09:44:19 ../../src/main/python/asserts.py:36: in _assert_equal 09:44:19 _assert_equal(cpu[field], gpu[field], float_check, path + [field]) 09:44:19 ../../src/main/python/asserts.py:43: in _assert_equal 09:44:19 _assert_equal(cpu[index], gpu[index], float_check, path + [index]) 09:44:19 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 09:44:19 09:44:19 cpu = -4.9069119243789216e-275, gpu = -4.302064318624199e-276 09:44:19 float_check =. at 0x7f1a9a9a4b80> 09:44:19 path = [0, 'percentile(val, array(0.1, 0.5, 0.9), 1)', 1] 09:44:19 09:44:19 def _assert_equal(cpu, gpu, float_check, path): 09:44:19 t = type(cpu) 09:44:19 if (t is Row): 09:44:19 assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu)) 09:44:19 if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"): 09:44:19 assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__) 09:44:19 for field in cpu.__fields__: 09:44:19 _assert_equal(cpu[field], gpu[field], float_check, path + [field]) 09:44:19 else: 09:44:19 for index in range(len(cpu)): 09:44:19 _assert_equal(cpu[index], gpu[index], float_check, path + [index]) 09:44:19 elif (t is list): 09:44:19 assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu)) 09:44:19 for index in range(len(cpu)): 09:44:19 _assert_equal(cpu[index], gpu[index], float_check, path + [index]) 09:44:19 elif (t is tuple): 09:44:19 assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu)) 09:44:19 for index in range(len(cpu)): 09:44:19 _assert_equal(cpu[index], gpu[index], float_check, path + [index]) 09:44:19 elif (t is pytypes.GeneratorType): 09:44:19 index = 0 09:44:19 # generator has no zip :( so we have to do this the hard way 09:44:19 done = False 09:44:19 while not done: 09:44:19 sub_cpu = None 09:44:19 sub_gpu = None 09:44:19 try: 09:44:19 sub_cpu = next(cpu) 09:44:19 except StopIteration: 09:44:19 done = True 09:44:19 09:44:19 try: 09:44:19 sub_gpu = next(gpu) 09:44:19 except StopIteration: 09:44:19 done = True 09:44:19 09:44:19 if done: 09:44:19 assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path) 09:44:19 else: 09:44:19 _assert_equal(sub_cpu, sub_gpu, float_check, path + [index]) 09:44:19 09:44:19 index = index + 1 09:44:19 elif (t is dict): 09:44:19 # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark 09:44:19 # so sort the items to do our best with ignoring the order of dicts 09:44:19 cpu_items = list(cpu.items()).sort(key=_RowCmp) 09:44:19 gpu_items = list(gpu.items()).sort(key=_RowCmp) 09:44:19 _assert_equal(cpu_items, gpu_items, float_check, path + ["map"]) 09:44:19 elif (t is int): 09:44:19 assert cpu == gpu, "GPU and CPU int values are different at {}".format(path) 09:44:19 elif (t is float): 09:44:19 if (math.isnan(cpu)): 09:44:19 assert math.isnan(gpu), "GPU and CPU float values are different at {}".format(path) 09:44:19 else: 09:44:19 > assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path) 09:44:19 E AssertionError: GPU and CPU float values are different [0, 'percentile(val, array(0.1, 0.5, 0.9), 1)', 1] 09:44:19 09:44:19 ../../src/main/python/asserts.py:83: AssertionError 09:44:19 ----------------------------- Captured stdout call ----------------------------- 09:44:19 ### CPU RUN ### 09:44:19 ### GPU RUN ### 09:44:19 ### COLLECT: GPU TOOK 0.26613664627075195 CPU TOOK 0.17803049087524414 ### 09:44:19 --- CPU OUTPUT 09:44:19 +++ GPU OUTPUT 09:44:19 @@ -1 +1 @@ 09:44:19 -Row(percentile(val, CAST(0.1 AS DOUBLE), 1)=-3.0600528894266366e+181, percentile(val, CAST(0 AS DOUBLE), 1)=-2.4711378026196358e+293, percentile(val, CAST(1 AS DOUBLE), 1)=nan, percentile(val, array(0.1), 1)=[-3.0600528894266366e+181], percentile(val, array(), 1)=None, percentile(val, array(0.1, 0.5, 0.9), 1)=[-3.0600528894266366e+181, -4.9069119243789216e-275, 1.7532295949136916e+204], percentile(val, array(CAST(0 AS DECIMAL(14,4)), CAST(0.0001 AS DECIMAL(14,4)), CAST(0.5 AS DECIMAL(14,4)), CAST(0.9999 AS DECIMAL(14,4)), CAST(1 AS DECIMAL(14,4))), 1)=[-2.4711378026196358e+293, -2.4711378026196358e+293, -4.9069119243789216e-275, nan, nan], percentile(val, CAST(0.1 AS DOUBLE), abs(freq))=-1.3398677426484608e+183, percentile(val, CAST(0 AS DOUBLE), abs(freq))=-2.4711378026196358e+293, percentile(val, CAST(1 AS DOUBLE), abs(freq))=nan, percentile(val, array(0.1), abs(freq))=[-1.3398677426484608e+183], percentile(val, array(), abs(freq))=None, percentile(val, array(0.1, 0.5, 0.9), abs(freq))=[-1.3398677426484608e+183, -4.302064318624199e-276, 5.054511151289938e+220], percentile(val, array(CAST(0 AS DECIMAL(14,4)), CAST(0.0001 AS DECIMAL(14,4)), CAST(0.5 AS DECIMAL(14,4)), CAST(0.9999 AS DECIMAL(14,4)), CAST(1 AS DECIMAL(14,4))), abs(freq))=[-2.4711378026196358e+293, -2.4711378026196358e+293, -4.302064318624199e-276, nan, nan]) 09:44:19 +Row(percentile(val, CAST(0.1 AS DOUBLE), 1)=-3.0600528894266366e+181, percentile(val, CAST(0 AS DOUBLE), 1)=-2.4711378026196358e+293, percentile(val, CAST(1 AS DOUBLE), 1)=nan, percentile(val, array(0.1), 1)=[-3.0600528894266366e+181], percentile(val, array(), 1)=None, percentile(val, array(0.1, 0.5, 0.9), 1)=[-3.0600528894266366e+181, -4.302064318624199e-276, 1.7532295949136916e+204], percentile(val, array(CAST(0 AS DECIMAL(14,4)), CAST(0.0001 AS DECIMAL(14,4)), CAST(0.5 AS DECIMAL(14,4)), CAST(0.9999 AS DECIMAL(14,4)), CAST(1 AS DECIMAL(14,4))), 1)=[-2.4711378026196358e+293, -2.4711378026196358e+293, -4.302064318624199e-276, nan, nan], percentile(val, CAST(0.1 AS DOUBLE), abs(freq))=-1.3398677426484608e+183, percentile(val, CAST(0 AS DOUBLE), abs(freq))=-2.4711378026196358e+293, percentile(val, CAST(1 AS DOUBLE), abs(freq))=nan, percentile(val, array(0.1), abs(freq))=[-1.3398677426484608e+183], percentile(val, array(), abs(freq))=None, percentile(val, array(0.1, 0.5, 0.9), abs(freq))=[-1.3398677426484608e+183, -4.302064318624199e-276, 5.054511151289938e+220], percentile(val, array(CAST(0 AS DECIMAL(14,4)), CAST(0.0001 AS DECIMAL(14,4)), CAST(0.5 AS DECIMAL(14,4)), CAST(0.9999 AS DECIMAL(14,4)), CAST(1 AS DECIMAL(14,4))), abs(freq))=[-2.4711378026196358e+293, -2.4711378026196358e+293, -4.302064318624199e-276, nan, nan])
Steps/Code to reproduce bug
Expected behavior
Environment details (please complete the following information)
- Environment location: Dataproc 2.0 Ubuntu 18.04
Additional context
This is 100% repeatable, and it calculates different results for 0.5 (median value) every time. I think this is a bug in Spark that I found a while ago.
https://issues.apache.org/jira/browse/SPARK-45599
Not sure if we want to avoid -0.0 in our test cases until this is fixed or what. (This one had 42 out of 2048 that were -0.0 and 42 that were 0.0, which is what is needed to make the error happen with Spark)
I think the solution here is to update FloatGen and DoubleGen so that we can replace -0.0 with 0.0. We would enable it for these tests, but keep other tests still using -0.0s. We also should have a follow on issue so when SPARK-45599 is fixed that we can come back and turn on -0.0 testing for versions of Spark that get the right answer.
The underlying issue SPARK-45599 has been resolved, so we should follow up to turn on -0.0 testing for Spark 4.0.0+, 3.5.2+