gpu-bdb icon indicating copy to clipboard operation
gpu-bdb copied to clipboard

Q28 fails in automated nightly runs

Open beckernick opened this issue 5 years ago • 3 comments

This is the same error as we had in https://github.com/rapidsai/gpu-bdb/issues/140 that was in theory resolved by https://github.com/rapidsai/cuml/pull/3152 . cc @dantegd @VibhuJawa

28                                                                                                                      [958/1807]
Encountered Exception while running query
Traceback (most recent call last):
  File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 280, in run_dask_cudf_query
    config=config,
  File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 61, in benchmark
    result = func(*args, **kwargs)
  File "queries/q28/tpcx_bb_query_28.py", line 341, in main
    client=client, train_data=train_data, test_data=test_data
  File "queries/q28/tpcx_bb_query_28.py", line 285, in post_etl_processing
    model.fit(X_train, y_train)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/cuml/common/memory_utils.py", l$
ne 93, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/cuml/dask/naive_bayes/naive_bay$
s.py", line 190, in fit
    client=self.client)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/cuml/dask/common/func.py", line
63, in reduce
    workers = [(first(who_has[m.key]), m) for m in futures]
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/cuml/dask/common/func.py", line
63, in <listcomp>
    workers = [(first(who_has[m.key]), m) for m in futures]
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/toolz/itertoolz.py", line 376, i
n first
    return next(iter(seq))
StopIteration

beckernick avatar Dec 08 '20 23:12 beckernick

I cannot consistently reproduce this (though others have seen it as well). There may be something subtle happening with the Naive Bayes classifier.

beckernick avatar Dec 09 '20 15:12 beckernick

@beckernick , Could you able to fetch the logs from the workers (when you see this error again) , i suspect they might have some more context.

VibhuJawa avatar Dec 09 '20 19:12 VibhuJawa

Lost the logs from the failure in the automated nightly run, unfortunately. Could not reproduce this with 100 consecutive runs of Q28. Will be triggering a few long-running tests to see if I can grab them.

~I believe I saw a CUSPARSE_STATUS_NOT_INITIALIZED in the past causing the StopIteration, which on it's own might make me wonder if there's some odd behavior going on with the CUDA runtime.~

However, this query clearly succeeds repeatedly on its own. Perhaps there's some unexpected interaction occurring somewhere during the full sweep.

beckernick avatar Dec 14 '20 15:12 beckernick