spark-rapids [BUG] `test_range_running_window_float_decimal_sum_runs

On certain CI runs, one sees failures in the window function test named test_range_running_window_float_decimal_sum_runs_batched:

[2024-02-04T23:11:11.547Z] FAILED ../../src/main/python/window_function_test.py::test_range_running_window_float_decimal_sum_runs_batched[1000][DATAGEN_SEED=1707064878, INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT]

The reported diffs are as follows:

-Row(... double_sum=4.797632847838746e-69...)
+Row(... double_sum=4.797632847838745e-69...)
...
-Row(... double_sum=3.1738776725114095e+88...)
+Row(... double_sum=3.173877672511409e+88...)

This is a little strange, given that the test is marked as @approximate_float. So far, it has also proven impossible to reproduce locally.

Feb 05 '24 22:02 mythrocks

Here is a sampling of the most egregious diffs in the output:

-Row(p=None, oby=None, short_double_sum=None, double_sum=3.320125371694111e-34, short_float_sum=None, float_sum=1.425792694093375e+33, dec_sum=None)
+Row(p=None, oby=None, short_double_sum=None, double_sum=3.3201253716941104e-34, short_float_sum=None, float_sum=1.425792694093375e+33, dec_sum=None) 
-Row(p=None, oby=None, short_double_sum=None, double_sum=1.423710382667759e+126, short_float_sum=None, float_sum=3.2242434807184336e-06, dec_sum=None) 
+Row(p=None, oby=None, short_double_sum=None, double_sum=1.423710382667759e+126, short_float_sum=None, float_sum=3.224243480718433e-06, dec_sum=None)
-Row(p=None, oby=None, short_double_sum=None, double_sum=1.9011342475589104e+267, short_float_sum=None, float_sum=1.3461754566575804e+19, dec_sum=None)
+Row(p=None, oby=None, short_double_sum=None, double_sum=1.9011342475589104e+267, short_float_sum=None, float_sum=1.3461754566575806e+19, dec_sum=None)
-Row(p=None, oby=None, short_double_sum=None, double_sum=3.536650741791709e+271, short_float_sum=None, float_sum=1.5652249836245965e+32, dec_sum=None)
+Row(p=None, oby=None, short_double_sum=None, double_sum=3.536650741791709e+271, short_float_sum=None, float_sum=1.5652249836245963e+32, dec_sum=None)
-Row(p=None, oby=-32768, short_double_sum=-360448.0, double_sum=4.977188684129671e-210, short_float_sum=-360448.0, float_sum=1.2578720332658084e+29, dec_sum=Decimal('-360448.0')) 
+Row(p=None, oby=-32768, short_double_sum=-360448.0, double_sum=4.977188684129671e-210, short_float_sum=-360448.0, float_sum=1.2578720332658082e+29, dec_sum=Decimal('-360448.0'))
-Row(p=None, oby=-32436, short_double_sum=-556519.0, double_sum=1.5976044957509647e-78, short_float_sum=-556519.0, float_sum=1857.6042175782272, dec_sum=Decimal('-556519.0'))
+Row(p=None, oby=-32436, short_double_sum=-556519.0, double_sum=1.5976044957509644e-78, short_float_sum=-556519.0, float_sum=1857.6042175782272, dec_sum=Decimal('-556519.0'))
-Row(p=None, oby=-30594, short_double_sum=-964181.0, double_sum=1.4251673808780018e+16, short_float_sum=-964181.0, float_sum=1.6525291491337176e-27, dec_sum=Decimal('-964181.0'))
+Row(p=None, oby=-30594, short_double_sum=-964181.0, double_sum=1.4251673808780016e+16, short_float_sum=-964181.0, float_sum=1.6525291491337176e-27, dec_sum=Decimal('-964181.0'))
-Row(p=None, oby=-30508, short_double_sum=-994689.0, double_sum=2.3460384328273062e+165, short_float_sum=-994689.0, float_sum=2.6449957901576276e-12, dec_sum=Decimal('-994689.0'))
+Row(p=None, oby=-30508, short_double_sum=-994689.0, double_sum=2.3460384328273062e+165, short_float_sum=-994689.0, float_sum=2.6449957901576272e-12, dec_sum=Decimal('-994689.0'))
-Row(p=None, oby=-28407, short_double_sum=-1997254.0, double_sum=1.575301213703218e+61, short_float_sum=-1997254.0, float_sum=3.940248080377917e+32, dec_sum=Decimal('-1997254.0'))
+Row(p=None, oby=-28407, short_double_sum=-1997254.0, double_sum=1.5753012137032178e+61, short_float_sum=-1997254.0, float_sum=3.940248080377916e+32, dec_sum=Decimal('-1997254.0'))
-Row(p=None, oby=-27989, short_double_sum=-2166308.0, double_sum=5.7222003320949304e-111, short_float_sum=-2166308.0, float_sum=1.8995236069283106e+32, dec_sum=Decimal('-2166308.0'))
+Row(p=None, oby=-27989, short_double_sum=-2166308.0, double_sum=5.72220033209493e-111, short_float_sum=-2166308.0, float_sum=1.8995236069283103e+32, dec_sum=Decimal('-2166308.0'))
-Row(p=None, oby=-22422, short_double_sum=-3762895.0, double_sum=1.3124717977250846e-206, short_float_sum=-3762895.0, float_sum=5.110702103270278e-35, dec_sum=Decimal('-3762895.0'))
+Row(p=None, oby=-22422, short_double_sum=-3762895.0, double_sum=1.3124717977250843e-206, short_float_sum=-3762895.0, float_sum=5.110702103270278e-35, dec_sum=Decimal('-3762895.0'))
-Row(p=None, oby=-19922, short_double_sum=-4435771.0, double_sum=2.9331653242274835e-241, short_float_sum=-4435771.0, float_sum=1.5316559583604917e+27, dec_sum=Decimal('-4435771.0'))
+Row(p=None, oby=-19922, short_double_sum=-4435771.0, double_sum=2.9331653242274832e-241, short_float_sum=-4435771.0, float_sum=1.5316559583604914e+27, dec_sum=Decimal('-4435771.0'))
-Row(p=None, oby=-14911, short_double_sum=-5405015.0, double_sum=2.9797860062816927e-167, short_float_sum=-5405015.0, float_sum=1.4775188260736515e-29, dec_sum=Decimal('-5405015.0'))
+Row(p=None, oby=-14911, short_double_sum=-5405015.0, double_sum=2.9797860062816923e-167, short_float_sum=-5405015.0, float_sum=1.4775188260736515e-29, dec_sum=Decimal('-5405015.0'))
-Row(p=None, oby=-14022, short_double_sum=-5591232.0, double_sum=2.3870085730304104e-255, short_float_sum=-5591232.0, float_sum=1.6656914299845682e+31, dec_sum=Decimal('-5591232.0'))
+Row(p=None, oby=-14022, short_double_sum=-5591232.0, double_sum=2.3870085730304104e-255, short_float_sum=-5591232.0, float_sum=1.6656914299845685e+31, dec_sum=Decimal('-5591232.0'))
-Row(p=None, oby=-12091, short_double_sum=-5889954.0, double_sum=7.01490562779971e+222, short_float_sum=-5889954.0, float_sum=1.2687089330344853e+31, dec_sum=Decimal('-5889954.0'))
+Row(p=None, oby=-12091, short_double_sum=-5889954.0, double_sum=7.01490562779971e+222, short_float_sum=-5889954.0, float_sum=1.2687089330344856e+31, dec_sum=Decimal('-5889954.0'))
-Row(p=None, oby=-10653, short_double_sum=-6057595.0, double_sum=1347267998.1720455, short_float_sum=-6057595.0, float_sum=1.495041434408932e+17, dec_sum=Decimal('-6057595.0'))
+Row(p=None, oby=-10653, short_double_sum=-6057595.0, double_sum=1347267998.1720452, short_float_sum=-6057595.0, float_sum=1.495041434408932e+17, dec_sum=Decimal('-6057595.0'))
-Row(p=None, oby=4621, short_double_sum=-6544450.0, double_sum=9.01343733242911e+182, short_float_sum=-6544450.0, float_sum=2.1577671275371775e-08, dec_sum=Decimal('-6544450.0'))
+Row(p=None, oby=4621, short_double_sum=-6544450.0, double_sum=9.01343733242911e+182, short_float_sum=-6544450.0, float_sum=2.1577671275371772e-08, dec_sum=Decimal('-6544450.0'))

I'm about halfway through the eyeballing the output. I'll post here, if I find any deviations that are worse. I think the above should have passed the @approximate_float test.

It appears that this error didn't occur on the last run, although a different test did fail: https://github.com/NVIDIA/spark-rapids/issues/10388.

Feb 07 '24 00:02 mythrocks

Ah, shoot. Here it is:

-Row(p=-1537828595, oby=26650, short_double_sum=32330.0, double_sum=inf, short_float_sum=32330.0, float_sum=7.066224196393988e+23, dec_sum=Decimal('32330.0'))
+Row(p=-1537828595, oby=26650, short_double_sum=32330.0, double_sum=1.7976931348623157e+308, short_float_sum=32330.0, float_sum=7.066224196393988e+23, dec_sum=Decimal('32330.0'))

Looks like the GPU result produces a very large number, not inf. I'll try to repro this.

Feb 07 '24 19:02 mythrocks

I should mention here that this test failed once a couple of weeks ago, and hasn't been reproducible since. :/

Feb 21 '24 19:02 mythrocks

Closing this as not reproducible. We'll reopen if this occurs again.

Mar 14 '24 18:03 mythrocks

Saw this fail again in the nightly build on 24.10:

[2024-10-10T15:25:18.293Z] FAILED ../../src/main/python/window_function_test.py::test_range_running_window_float_decimal_sum_runs_batched[1000][DATAGEN_SEED=1728564930, TZ=UTC, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] - AssertionError: GPU and CPU float values are different [6904, 'double_sum']

Oct 10 '24 18:10 jlowe

Looks like the most recent failure was again an issue with infinity:

[2024-10-10T15:25:18.289Z] -Row(p=-81560160, oby=-14318, short_double_sum=-161930.0, double_sum=inf, short_float_sum=-161930.0, float_sum=2.215917728557696e-10, dec_sum=Decimal('-161930.0'))
[2024-10-10T15:25:18.289Z] +Row(p=-81560160, oby=-14318, short_double_sum=-161930.0, double_sum=1.7976931348623157e+308, short_float_sum=-161930.0, float_sum=2.215917728557696e-10, dec_sum=Decimal('-161930.0'))

I think the issue here is that the value the GPU comes up with is just an epsilon away from infinity (it's the largest possible double value), so maybe we need to update the approx float logic to account for this.

Oct 10 '24 18:10 jlowe

Thanks for recording the datagen seed value, @jlowe. I'll try to repro this with the failing seed, next week.

Oct 11 '24 23:10 mythrocks

another repro in rapids_databricks_nightly-pre_release-github, run:701 (branch-25.04) DATAGEN_SEED=1742879618

FAILED ../../src/main/python/window_function_test.py::test_range_running_window_float_decimal_sum_runs_batched[1000][DATAGEN_SEED=1742879618, TZ=UTC, INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] - AssertionError: GPU and CPU float values are different [827, 'double_sum']

[2025-03-25T06:47:50.105Z] 
[2025-03-25T06:47:50.105Z] cpu = inf, gpu = 1.7976931348623157e+308
[2025-03-25T06:47:50.105Z] float_check = <function get_float_check.<locals>.<lambda> at 0x7f4784535ea0>
[2025-03-25T06:47:50.105Z] path = [827, 'double_sum']
[2025-03-25T06:47:50.105Z]

Mar 25 '25 07:03 pxLi

[BUG] `test_range_running_window_float_decimal_sum_runs_batched` fails intermittently