Q28 fails in automated nightly runs
This is the same error as we had in https://github.com/rapidsai/gpu-bdb/issues/140 that was in theory resolved by https://github.com/rapidsai/cuml/pull/3152 . cc @dantegd @VibhuJawa
28 [958/1807]
Encountered Exception while running query
Traceback (most recent call last):
File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 280, in run_dask_cudf_query
config=config,
File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 61, in benchmark
result = func(*args, **kwargs)
File "queries/q28/tpcx_bb_query_28.py", line 341, in main
client=client, train_data=train_data, test_data=test_data
File "queries/q28/tpcx_bb_query_28.py", line 285, in post_etl_processing
model.fit(X_train, y_train)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/cuml/common/memory_utils.py", l$
ne 93, in cupy_rmm_wrapper
return func(*args, **kwargs)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/cuml/dask/naive_bayes/naive_bay$
s.py", line 190, in fit
client=self.client)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/cuml/dask/common/func.py", line
63, in reduce
workers = [(first(who_has[m.key]), m) for m in futures]
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/cuml/dask/common/func.py", line
63, in <listcomp>
workers = [(first(who_has[m.key]), m) for m in futures]
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/toolz/itertoolz.py", line 376, i
n first
return next(iter(seq))
StopIteration
I cannot consistently reproduce this (though others have seen it as well). There may be something subtle happening with the Naive Bayes classifier.
@beckernick , Could you able to fetch the logs from the workers (when you see this error again) , i suspect they might have some more context.
Lost the logs from the failure in the automated nightly run, unfortunately. Could not reproduce this with 100 consecutive runs of Q28. Will be triggering a few long-running tests to see if I can grab them.
~I believe I saw a CUSPARSE_STATUS_NOT_INITIALIZED in the past causing the StopIteration, which on it's own might make me wonder if there's some odd behavior going on with the CUDA runtime.~
However, this query clearly succeeds repeatedly on its own. Perhaps there's some unexpected interaction occurring somewhere during the full sweep.