spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG] `test_range_running_window_float_decimal_sum_runs_batched` fails intermittently

Open mythrocks opened this issue 1 year ago • 7 comments

On certain CI runs, one sees failures in the window function test named test_range_running_window_float_decimal_sum_runs_batched:

[2024-02-04T23:11:11.547Z] FAILED ../../src/main/python/window_function_test.py::test_range_running_window_float_decimal_sum_runs_batched[1000][DATAGEN_SEED=1707064878, INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT]

The reported diffs are as follows:

-Row(... double_sum=4.797632847838746e-69...)
+Row(... double_sum=4.797632847838745e-69...)
...
-Row(... double_sum=3.1738776725114095e+88...)
+Row(... double_sum=3.173877672511409e+88...)

This is a little strange, given that the test is marked as @approximate_float. So far, it has also proven impossible to reproduce locally.

mythrocks avatar Feb 05 '24 22:02 mythrocks

Here is a sampling of the most egregious diffs in the output:

-Row(p=None, oby=None, short_double_sum=None, double_sum=3.320125371694111e-34, short_float_sum=None, float_sum=1.425792694093375e+33, dec_sum=None)
+Row(p=None, oby=None, short_double_sum=None, double_sum=3.3201253716941104e-34, short_float_sum=None, float_sum=1.425792694093375e+33, dec_sum=None) 
-Row(p=None, oby=None, short_double_sum=None, double_sum=1.423710382667759e+126, short_float_sum=None, float_sum=3.2242434807184336e-06, dec_sum=None) 
+Row(p=None, oby=None, short_double_sum=None, double_sum=1.423710382667759e+126, short_float_sum=None, float_sum=3.224243480718433e-06, dec_sum=None)
-Row(p=None, oby=None, short_double_sum=None, double_sum=1.9011342475589104e+267, short_float_sum=None, float_sum=1.3461754566575804e+19, dec_sum=None)
+Row(p=None, oby=None, short_double_sum=None, double_sum=1.9011342475589104e+267, short_float_sum=None, float_sum=1.3461754566575806e+19, dec_sum=None)
-Row(p=None, oby=None, short_double_sum=None, double_sum=3.536650741791709e+271, short_float_sum=None, float_sum=1.5652249836245965e+32, dec_sum=None)
+Row(p=None, oby=None, short_double_sum=None, double_sum=3.536650741791709e+271, short_float_sum=None, float_sum=1.5652249836245963e+32, dec_sum=None)
-Row(p=None, oby=-32768, short_double_sum=-360448.0, double_sum=4.977188684129671e-210, short_float_sum=-360448.0, float_sum=1.2578720332658084e+29, dec_sum=Decimal('-360448.0')) 
+Row(p=None, oby=-32768, short_double_sum=-360448.0, double_sum=4.977188684129671e-210, short_float_sum=-360448.0, float_sum=1.2578720332658082e+29, dec_sum=Decimal('-360448.0'))
-Row(p=None, oby=-32436, short_double_sum=-556519.0, double_sum=1.5976044957509647e-78, short_float_sum=-556519.0, float_sum=1857.6042175782272, dec_sum=Decimal('-556519.0'))
+Row(p=None, oby=-32436, short_double_sum=-556519.0, double_sum=1.5976044957509644e-78, short_float_sum=-556519.0, float_sum=1857.6042175782272, dec_sum=Decimal('-556519.0'))
-Row(p=None, oby=-30594, short_double_sum=-964181.0, double_sum=1.4251673808780018e+16, short_float_sum=-964181.0, float_sum=1.6525291491337176e-27, dec_sum=Decimal('-964181.0'))
+Row(p=None, oby=-30594, short_double_sum=-964181.0, double_sum=1.4251673808780016e+16, short_float_sum=-964181.0, float_sum=1.6525291491337176e-27, dec_sum=Decimal('-964181.0'))
-Row(p=None, oby=-30508, short_double_sum=-994689.0, double_sum=2.3460384328273062e+165, short_float_sum=-994689.0, float_sum=2.6449957901576276e-12, dec_sum=Decimal('-994689.0'))
+Row(p=None, oby=-30508, short_double_sum=-994689.0, double_sum=2.3460384328273062e+165, short_float_sum=-994689.0, float_sum=2.6449957901576272e-12, dec_sum=Decimal('-994689.0'))
-Row(p=None, oby=-28407, short_double_sum=-1997254.0, double_sum=1.575301213703218e+61, short_float_sum=-1997254.0, float_sum=3.940248080377917e+32, dec_sum=Decimal('-1997254.0'))
+Row(p=None, oby=-28407, short_double_sum=-1997254.0, double_sum=1.5753012137032178e+61, short_float_sum=-1997254.0, float_sum=3.940248080377916e+32, dec_sum=Decimal('-1997254.0'))
-Row(p=None, oby=-27989, short_double_sum=-2166308.0, double_sum=5.7222003320949304e-111, short_float_sum=-2166308.0, float_sum=1.8995236069283106e+32, dec_sum=Decimal('-2166308.0'))
+Row(p=None, oby=-27989, short_double_sum=-2166308.0, double_sum=5.72220033209493e-111, short_float_sum=-2166308.0, float_sum=1.8995236069283103e+32, dec_sum=Decimal('-2166308.0'))
-Row(p=None, oby=-22422, short_double_sum=-3762895.0, double_sum=1.3124717977250846e-206, short_float_sum=-3762895.0, float_sum=5.110702103270278e-35, dec_sum=Decimal('-3762895.0'))
+Row(p=None, oby=-22422, short_double_sum=-3762895.0, double_sum=1.3124717977250843e-206, short_float_sum=-3762895.0, float_sum=5.110702103270278e-35, dec_sum=Decimal('-3762895.0'))
-Row(p=None, oby=-19922, short_double_sum=-4435771.0, double_sum=2.9331653242274835e-241, short_float_sum=-4435771.0, float_sum=1.5316559583604917e+27, dec_sum=Decimal('-4435771.0'))
+Row(p=None, oby=-19922, short_double_sum=-4435771.0, double_sum=2.9331653242274832e-241, short_float_sum=-4435771.0, float_sum=1.5316559583604914e+27, dec_sum=Decimal('-4435771.0'))
-Row(p=None, oby=-14911, short_double_sum=-5405015.0, double_sum=2.9797860062816927e-167, short_float_sum=-5405015.0, float_sum=1.4775188260736515e-29, dec_sum=Decimal('-5405015.0'))
+Row(p=None, oby=-14911, short_double_sum=-5405015.0, double_sum=2.9797860062816923e-167, short_float_sum=-5405015.0, float_sum=1.4775188260736515e-29, dec_sum=Decimal('-5405015.0'))
-Row(p=None, oby=-14022, short_double_sum=-5591232.0, double_sum=2.3870085730304104e-255, short_float_sum=-5591232.0, float_sum=1.6656914299845682e+31, dec_sum=Decimal('-5591232.0'))
+Row(p=None, oby=-14022, short_double_sum=-5591232.0, double_sum=2.3870085730304104e-255, short_float_sum=-5591232.0, float_sum=1.6656914299845685e+31, dec_sum=Decimal('-5591232.0'))
-Row(p=None, oby=-12091, short_double_sum=-5889954.0, double_sum=7.01490562779971e+222, short_float_sum=-5889954.0, float_sum=1.2687089330344853e+31, dec_sum=Decimal('-5889954.0'))
+Row(p=None, oby=-12091, short_double_sum=-5889954.0, double_sum=7.01490562779971e+222, short_float_sum=-5889954.0, float_sum=1.2687089330344856e+31, dec_sum=Decimal('-5889954.0'))
-Row(p=None, oby=-10653, short_double_sum=-6057595.0, double_sum=1347267998.1720455, short_float_sum=-6057595.0, float_sum=1.495041434408932e+17, dec_sum=Decimal('-6057595.0'))
+Row(p=None, oby=-10653, short_double_sum=-6057595.0, double_sum=1347267998.1720452, short_float_sum=-6057595.0, float_sum=1.495041434408932e+17, dec_sum=Decimal('-6057595.0'))
-Row(p=None, oby=4621, short_double_sum=-6544450.0, double_sum=9.01343733242911e+182, short_float_sum=-6544450.0, float_sum=2.1577671275371775e-08, dec_sum=Decimal('-6544450.0'))
+Row(p=None, oby=4621, short_double_sum=-6544450.0, double_sum=9.01343733242911e+182, short_float_sum=-6544450.0, float_sum=2.1577671275371772e-08, dec_sum=Decimal('-6544450.0'))

I'm about halfway through the eyeballing the output. I'll post here, if I find any deviations that are worse. I think the above should have passed the @approximate_float test.

It appears that this error didn't occur on the last run, although a different test did fail: https://github.com/NVIDIA/spark-rapids/issues/10388.

mythrocks avatar Feb 07 '24 00:02 mythrocks

Ah, shoot. Here it is:

-Row(p=-1537828595, oby=26650, short_double_sum=32330.0, double_sum=inf, short_float_sum=32330.0, float_sum=7.066224196393988e+23, dec_sum=Decimal('32330.0'))
+Row(p=-1537828595, oby=26650, short_double_sum=32330.0, double_sum=1.7976931348623157e+308, short_float_sum=32330.0, float_sum=7.066224196393988e+23, dec_sum=Decimal('32330.0'))

Looks like the GPU result produces a very large number, not inf. I'll try to repro this.

mythrocks avatar Feb 07 '24 19:02 mythrocks

I should mention here that this test failed once a couple of weeks ago, and hasn't been reproducible since. :/

mythrocks avatar Feb 21 '24 19:02 mythrocks

Closing this as not reproducible. We'll reopen if this occurs again.

mythrocks avatar Mar 14 '24 18:03 mythrocks

Saw this fail again in the nightly build on 24.10:

[2024-10-10T15:25:18.293Z] FAILED ../../src/main/python/window_function_test.py::test_range_running_window_float_decimal_sum_runs_batched[1000][DATAGEN_SEED=1728564930, TZ=UTC, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] - AssertionError: GPU and CPU float values are different [6904, 'double_sum']

jlowe avatar Oct 10 '24 18:10 jlowe

Looks like the most recent failure was again an issue with infinity:

[2024-10-10T15:25:18.289Z] -Row(p=-81560160, oby=-14318, short_double_sum=-161930.0, double_sum=inf, short_float_sum=-161930.0, float_sum=2.215917728557696e-10, dec_sum=Decimal('-161930.0'))
[2024-10-10T15:25:18.289Z] +Row(p=-81560160, oby=-14318, short_double_sum=-161930.0, double_sum=1.7976931348623157e+308, short_float_sum=-161930.0, float_sum=2.215917728557696e-10, dec_sum=Decimal('-161930.0'))

I think the issue here is that the value the GPU comes up with is just an epsilon away from infinity (it's the largest possible double value), so maybe we need to update the approx float logic to account for this.

jlowe avatar Oct 10 '24 18:10 jlowe

Thanks for recording the datagen seed value, @jlowe. I'll try to repro this with the failing seed, next week.

mythrocks avatar Oct 11 '24 23:10 mythrocks

another repro in rapids_databricks_nightly-pre_release-github, run:701 (branch-25.04) DATAGEN_SEED=1742879618

FAILED ../../src/main/python/window_function_test.py::test_range_running_window_float_decimal_sum_runs_batched[1000][DATAGEN_SEED=1742879618, TZ=UTC, INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] - AssertionError: GPU and CPU float values are different [827, 'double_sum']
[2025-03-25T06:47:50.105Z] 
[2025-03-25T06:47:50.105Z] cpu = inf, gpu = 1.7976931348623157e+308
[2025-03-25T06:47:50.105Z] float_check = <function get_float_check.<locals>.<lambda> at 0x7f4784535ea0>
[2025-03-25T06:47:50.105Z] path = [827, 'double_sum']
[2025-03-25T06:47:50.105Z] 

pxLi avatar Mar 25 '25 07:03 pxLi