[BUG] test_exact_percentile_groupby failed GPU and CPU float values are different intermittently
Describe the bug first seen in rapids_it-matrix-dev-github, run:251 (25.08)
failed spark321 DATAGEN_SEED=1749332659, INJECT_OOM
AssertionError: GPU and CPU float values are different [0, 'percentile(val, array(0, 0.0001, 0.5, 0.9999, 1))', 1]
SPARK_VER = '3.2.1' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby[true-[('key', RepeatSeq(Integer)), ('val', Float), ('freq', Long(not_null))]1][DATAGEN_SEED=1749332659, TZ=UTC, INJECT_OOM, IGNORE_ORDER]
...
AssertionError: GPU and CPU float values are different [0, 'percentile(val, array(0, 0.0001, 0.5, 0.9999, 1))', 1]
SPARK_VER = '3.2.1' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby[false-[('key', RepeatSeq(Integer)), ('val', Float), ('freq', Long(not_null))]1][DATAGEN_SEED=1749332659, TZ=UTC, INJECT_OOM, IGNORE_ORDER]
cpu = -3.3551839378558946e+38, gpu = -3.354843655509256e+38
Steps/Code to reproduce bug
cannot repro locally with DATAGEN_SEED=1749332659 SPARK_RAPIDS_TEST_INJECT_OOM_SEED=1749332659 using spark321, there might be some other issue causing the non-deterministic results
Expected behavior A clear and concise description of what you expected to happen.
Environment details (please complete the following information)
- Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
- Spark configuration settings related to the issue
Additional context Add any other context about the problem here.
This has not appeared again, even with the same datagen seed. Usually we run where shuffle is done on a single node. The results are based on shuffle and therefore can be non-deterministic. Since this has not reproduced this will be closed for now.
a repro in rapids-it-azure-databricks-13.3 run: 272
[2025-08-21T08:19:59.023Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_exact_percentile_groupby[true-[('key', RepeatSeq(Integer)), ('val', Float), ('freq', Long(not_null))]0][DATAGEN_SEED=1755755413, TZ=UTC, IGNORE_ORDER] - AssertionError: GPU and CPU float values are different [0, 'percentile(val, 0.1, 1)']
...
cpu = -6.399965793919102e+28, gpu = -4.8128458175264785e+28
[2025-08-21T08:19:59.023Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_exact_percentile_groupby[false-[('key', RepeatSeq(Integer)), ('val', Float), ('freq', Long(not_null))]0][DATAGEN_SEED=1755755413, TZ=UTC, INJECT_OOM, IGNORE_ORDER] - AssertionError: GPU and CPU float values are different [0, 'percentile(val, 0.1, 1)']
...
cpu = -6.399965793919102e+28, gpu = -4.8128458175264785e+28
Removed the wontfix label for now. Feel free to close if we still allow the non-deterministic case cc @sameerz
Removed the
wontfixlabel for now. Feel free to close if we still allow the non-deterministic case cc @sameerz
cc-ing @mattahrens.
Underlying Spark bug is fixed in Spark 3.5.2: https://issues.apache.org/jira/browse/SPARK-45599.
Remediation for this issue is updating the test case to not use -0.0 for versions older than Spark 3.5.2.
another repro in rapids_it-matrix-pre_release-github/175
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby[true-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]0][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby[false-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]0][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby_partial_fallback_to_cpu[true-false-partial-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,AggregateExpression,Alias,Cast,Literal,ProjectExec,Percentile)]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby_partial_fallback_to_cpu[true-false-final|complete-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]][DATAGEN_SEED=1759822834, TZ=UTC, IGNORE_ORDER, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,AggregateExpression,Alias,Cast,Literal,ProjectExec,Percentile)]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby_partial_fallback_to_cpu[true-true-partial-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,AggregateExpression,Alias,Cast,Literal,ProjectExec,Percentile)]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby_partial_fallback_to_cpu[true-true-final|complete-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,AggregateExpression,Alias,Cast,Literal,ProjectExec,Percentile)]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby_partial_fallback_to_cpu[false-false-partial-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,AggregateExpression,Alias,Cast,Literal,ProjectExec,Percentile)]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby_partial_fallback_to_cpu[false-false-final|complete-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,AggregateExpression,Alias,Cast,Literal,ProjectExec,Percentile)]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby_partial_fallback_to_cpu[false-true-partial-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,AggregateExpression,Alias,Cast,Literal,ProjectExec,Percentile)]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby_partial_fallback_to_cpu[false-true-final|complete-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,AggregateExpression,Alias,Cast,Literal,ProjectExec,Percentile)]]
test_exact_percentile_groupby and test_exact_percentile_groupby_partial_fallback_to_cpu could mismatch cpu and gpu output intermittently
another repro in rapids-it-azure-databricks-14.3/181
[2025-12-07T10:11:20.116Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_exact_percentile_groupby[true-[('key', RepeatSeq(Integer)), ('val', Float), ('freq', Long(not_null))]0][DATAGEN_SEED=1765092256, TZ=UTC, IGNORE_ORDER] - AssertionError: GPU (-3.1334851557073905e+21) and CPU (-3.848356931796457e+21) float values are different at [0, 'percentile(val, 0.1, 1)']
[2025-12-07T10:11:20.116Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_exact_percentile_groupby[false-[('key', RepeatSeq(Integer)), ('val', Float), ('freq', Long(not_null))]0][DATAGEN_SEED=1765092256, TZ=UTC, IGNORE_ORDER] - AssertionError: GPU (-3.1334851557073905e+21) and CPU (-3.848356931796457e+21) float values are different at [0, 'percentile(val, 0.1, 1)']