spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG] test_exact_percentile_groupby failed GPU and CPU float values are different intermittently

Open pxLi opened this issue 7 months ago • 6 comments

Describe the bug first seen in rapids_it-matrix-dev-github, run:251 (25.08)

failed spark321 DATAGEN_SEED=1749332659, INJECT_OOM

AssertionError: GPU and CPU float values are different [0, 'percentile(val, array(0, 0.0001, 0.5, 0.9999, 1))', 1]
SPARK_VER = '3.2.1' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby[true-[('key', RepeatSeq(Integer)), ('val', Float), ('freq', Long(not_null))]1][DATAGEN_SEED=1749332659, TZ=UTC, INJECT_OOM, IGNORE_ORDER]
...
AssertionError: GPU and CPU float values are different [0, 'percentile(val, array(0, 0.0001, 0.5, 0.9999, 1))', 1]
SPARK_VER = '3.2.1' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby[false-[('key', RepeatSeq(Integer)), ('val', Float), ('freq', Long(not_null))]1][DATAGEN_SEED=1749332659, TZ=UTC, INJECT_OOM, IGNORE_ORDER]
cpu = -3.3551839378558946e+38, gpu = -3.354843655509256e+38

Steps/Code to reproduce bug cannot repro locally with DATAGEN_SEED=1749332659 SPARK_RAPIDS_TEST_INJECT_OOM_SEED=1749332659 using spark321, there might be some other issue causing the non-deterministic results

Expected behavior A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Additional context Add any other context about the problem here.

pxLi avatar Jun 09 '25 00:06 pxLi

This has not appeared again, even with the same datagen seed. Usually we run where shuffle is done on a single node. The results are based on shuffle and therefore can be non-deterministic. Since this has not reproduced this will be closed for now.

sameerz avatar Jun 17 '25 20:06 sameerz

a repro in rapids-it-azure-databricks-13.3 run: 272

[2025-08-21T08:19:59.023Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_exact_percentile_groupby[true-[('key', RepeatSeq(Integer)), ('val', Float), ('freq', Long(not_null))]0][DATAGEN_SEED=1755755413, TZ=UTC, IGNORE_ORDER] - AssertionError: GPU and CPU float values are different [0, 'percentile(val, 0.1, 1)']
...
cpu = -6.399965793919102e+28, gpu = -4.8128458175264785e+28


[2025-08-21T08:19:59.023Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_exact_percentile_groupby[false-[('key', RepeatSeq(Integer)), ('val', Float), ('freq', Long(not_null))]0][DATAGEN_SEED=1755755413, TZ=UTC, INJECT_OOM, IGNORE_ORDER] - AssertionError: GPU and CPU float values are different [0, 'percentile(val, 0.1, 1)']
...
cpu = -6.399965793919102e+28, gpu = -4.8128458175264785e+28

Removed the wontfix label for now. Feel free to close if we still allow the non-deterministic case cc @sameerz

pxLi avatar Aug 21 '25 08:08 pxLi

Removed the wontfix label for now. Feel free to close if we still allow the non-deterministic case cc @sameerz

cc-ing @mattahrens.

mythrocks avatar Aug 21 '25 18:08 mythrocks

Underlying Spark bug is fixed in Spark 3.5.2: https://issues.apache.org/jira/browse/SPARK-45599.

Remediation for this issue is updating the test case to not use -0.0 for versions older than Spark 3.5.2.

mattahrens avatar Sep 16 '25 20:09 mattahrens

another repro in rapids_it-matrix-pre_release-github/175

Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby[true-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]0][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby[false-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]0][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby_partial_fallback_to_cpu[true-false-partial-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,AggregateExpression,Alias,Cast,Literal,ProjectExec,Percentile)]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby_partial_fallback_to_cpu[true-false-final|complete-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]][DATAGEN_SEED=1759822834, TZ=UTC, IGNORE_ORDER, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,AggregateExpression,Alias,Cast,Literal,ProjectExec,Percentile)]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby_partial_fallback_to_cpu[true-true-partial-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,AggregateExpression,Alias,Cast,Literal,ProjectExec,Percentile)]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby_partial_fallback_to_cpu[true-true-final|complete-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,AggregateExpression,Alias,Cast,Literal,ProjectExec,Percentile)]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby_partial_fallback_to_cpu[false-false-partial-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,AggregateExpression,Alias,Cast,Literal,ProjectExec,Percentile)]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby_partial_fallback_to_cpu[false-false-final|complete-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,AggregateExpression,Alias,Cast,Literal,ProjectExec,Percentile)]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby_partial_fallback_to_cpu[false-true-partial-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,AggregateExpression,Alias,Cast,Literal,ProjectExec,Percentile)]]
Matrix - SPARK_VER = '3.2.4' / Regular Test / src.main.python.hash_aggregate_test.test_exact_percentile_groupby_partial_fallback_to_cpu[false-true-final|complete-[('key', RepeatSeq(Integer)), ('val', Double), ('freq', Long(not_null))]][DATAGEN_SEED=1759822834, TZ=UTC, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(ObjectHashAggregateExec,SortAggregateExec,ShuffleExchangeExec,HashPartitioning,AggregateExpression,Alias,Cast,Literal,ProjectExec,Percentile)]]

test_exact_percentile_groupby and test_exact_percentile_groupby_partial_fallback_to_cpu could mismatch cpu and gpu output intermittently

pxLi avatar Oct 08 '25 06:10 pxLi

another repro in rapids-it-azure-databricks-14.3/181

[2025-12-07T10:11:20.116Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_exact_percentile_groupby[true-[('key', RepeatSeq(Integer)), ('val', Float), ('freq', Long(not_null))]0][DATAGEN_SEED=1765092256, TZ=UTC, IGNORE_ORDER] - AssertionError: GPU (-3.1334851557073905e+21) and CPU (-3.848356931796457e+21) float values are different at [0, 'percentile(val, 0.1, 1)']

[2025-12-07T10:11:20.116Z] FAILED ../../src/main/python/hash_aggregate_test.py::test_exact_percentile_groupby[false-[('key', RepeatSeq(Integer)), ('val', Float), ('freq', Long(not_null))]0][DATAGEN_SEED=1765092256, TZ=UTC, IGNORE_ORDER] - AssertionError: GPU (-3.1334851557073905e+21) and CPU (-3.848356931796457e+21) float values are different at [0, 'percentile(val, 0.1, 1)']


pxLi avatar Dec 09 '25 08:12 pxLi