cuml icon indicating copy to clipboard operation
cuml copied to clipboard

knn predict wrong and varying predictions, cudaErrorIllegalAddress, or core dump

Open pseudotensor opened this issue 2 years ago • 7 comments

Same as this, but was closed by author even though not fixed:

import pickle
func, X = pickle.load(open("foo_df323a9b-bbb7-49a3-b06e-a9699702c09f.pkl", "rb"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/", line 586, in inner_get
    ret_val = func(*args, **kwargs)
  File "cuml/neighbors/kneighbors_classifier.pyx", line 300, in cuml.neighbors.kneighbors_classifier.KNeighborsClassifier.predict_proba
  File "cuml/raft/common/handle.pyx", line 86, in cuml.raft.common.handle.Handle.sync
cuml.raft.common.cuda.CudaRuntimeError: Error! cudaErrorIllegalAddress reason='an illegal memory access was encountered' extraMsg='Stream sync'

and upon exit of python interpreter there is core dump.

Seems possible that it is due to constant features, i.e. all 0 or all 1 etc.

This is using rapids 21.08 and other details about system are here:

However, what's also really bad about this situation is that sometimes the predictions are generated but are wrong, or keep changing (e.g. recalls to predict_proba(X) keep giving different results), or (e.g.) for multiclass one case will have 0's for all probas

E.g. for this file:

This is what the sequence looks like:

import pickle
func, X = pickle.load(open("KNNCUML_predict_b30d7318-b285-475d-943e-c48ebd2235df.pkl", "rb"))
>>> func(X)[0:5]
2022-03-10 15:11:25,215 C:  3% D:43.6GB  M:46.6GB  NODE:SERVER      26864  INFO   | init
array([[0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.]], dtype=float32)
>>> func(X)[0:5]
array([[1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.]], dtype=float32)
>>> func(X)[0:5]
array([[0., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.]], dtype=float32)
>>> func(X)[0:5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/", line 586, in inner_get
    ret_val = func(*args, **kwargs)
  File "cuml/neighbors/kneighbors_classifier.pyx", line 300, in cuml.neighbors.kneighbors_classifier.KNeighborsClassifier.predict_proba
  File "cuml/raft/common/handle.pyx", line 86, in cuml.raft.common.handle.Handle.sync
cuml.raft.common.cuda.CudaRuntimeError: Error! cudaErrorIllegalAddress reason='an illegal memory access was encountered' extraMsg='Stream sync'
>>> func(X)[0:5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/", line 586, in inner_get
    ret_val = func(*args, **kwargs)
  File "cuml/neighbors/kneighbors_classifier.pyx", line 256, in cuml.neighbors.kneighbors_classifier.KNeighborsClassifier.predict_proba
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/", line 586, in inner_get
    ret_val = func(*args, **kwargs)
  File "cuml/neighbors/nearest_neighbors.pyx", line 488, in cuml.neighbors.nearest_neighbors.NearestNeighbors.kneighbors
  File "cuml/neighbors/nearest_neighbors.pyx", line 573, in cuml.neighbors.nearest_neighbors.NearestNeighbors._kneighbors
  File "cuml/neighbors/nearest_neighbors.pyx", line 635, in cuml.neighbors.nearest_neighbors.NearestNeighbors._kneighbors_dense
  File "/home/jon/minicondadai_py38/lib/python3.8/", line 75, in inner
    return func(*args, **kwds)
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/", line 360, in inner
    return func(*args, **kwargs)
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/common/", line 306, in input_to_cuml_array
    X = convert_dtype(X,
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/", line 360, in inner
    return func(*args, **kwargs)
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/common/", line 560, in convert_dtype
    would_lose_info = _typecast_will_lose_information(X, to_dtype)
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/common/", line 612, in _typecast_will_lose_information
    X_m = X.values
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cudf/core/", line 994, in values
    return cupy.asarray(self.as_gpu_matrix())
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cudf/core/", line 3577, in as_gpu_matrix
    matrix = cupy.empty(shape=(nrow, ncol), dtype=cupy_dtype, order=order)
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cupy/_creation/", line 22, in empty
    return cupy.ndarray(shape, dtype, order=order)
  File "cupy/_core/core.pyx", line 164, in cupy._core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 735, in cupy.cuda.memory.alloc
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/rmm/", line 212, in rmm_cupy_allocator
    buf = librmm.device_buffer.DeviceBuffer(size=nbytes, stream=stream)
  File "rmm/_lib/device_buffer.pyx", line 84, in rmm._lib.device_buffer.DeviceBuffer.__cinit__
MemoryError: std::bad_alloc: CUDA error at: /home/jon/minicondadai_py38/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorIllegalAddress an illegal memory access was encountered

sometimes a repeat will do a full core dump.

So even when predictions don't cause a crash or error, they still give wrong/varying answers and even probas don't even add up to 1 (every class label has 0 proba).

The actual GPU usage is minimal:

Thu Mar 10 15:13:19 2022       
| NVIDIA-SMI 460.80       Driver Version: 460.80       CUDA Version: 11.2     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce RTX 2080    On   | 00000000:01:00.0  On |                  N/A |
| 45%   53C    P0    50W / 215W |   2021MiB /  7979MiB |      2%      Default |
|                               |                      |                  N/A |
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    0   N/A  N/A      2530      G   /usr/lib/xorg/Xorg                 55MiB |
|    0   N/A  N/A      2621      G   /usr/bin/gnome-shell              143MiB |
|    0   N/A  N/A      3966      G   /usr/lib/xorg/Xorg                738MiB |
|    0   N/A  N/A      4097      G   /usr/bin/gnome-shell              100MiB |
|    0   N/A  N/A      4903      G   ...AAAAAAAAA= --shared-files      132MiB |
|    0   N/A  N/A     26864      C   python                            843MiB |

pseudotensor avatar Mar 10 '22 19:03 pseudotensor

Is this reproducible in RAPIDS 22.02 / 22.04?

viclafargue avatar Mar 18 '22 17:03 viclafargue

I gave MRE so you guys can check.

pseudotensor avatar Mar 27 '22 19:03 pseudotensor

This shouldn't have been closed.

pseudotensor avatar Apr 17 '22 15:04 pseudotensor

Sorry for not replying earlier. It turns out that the pickle files could not be imported in 22.04. Since I could't see the code used, it's not possible for me to reproduce the issue.

For both exemples, I get :

Exception ignored in: <bound method NearestNeighbors.__del__ of KNeighborsClassifier()>
Traceback (most recent call last):
  File "cuml/neighbors/nearest_neighbors.pyx", line 889, in cuml.neighbors.nearest_neighbors.NearestNeighbors.__del__
  File "cuml/common/base.pyx", line 269, in cuml.common.base.Base.__getattr__
AttributeError: knn_index
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'cuml.raft'

viclafargue avatar Apr 20 '22 09:04 viclafargue

Regarding the cudaErrorIllegalAddress/coredump there might possibly be a link to an issue in RMM that was since solved :

viclafargue avatar Apr 20 '22 09:04 viclafargue

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar May 20 '22 10:05 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar Aug 18 '22 11:08 github-actions[bot]