cuml
cuml copied to clipboard
knn predict wrong and varying predictions, cudaErrorIllegalAddress, or core dump
Same as this, but was closed by author even though not fixed: https://github.com/rapidsai/cuml/issues/1685
foo_df323a9b-bbb7-49a3-b06e-a9699702c09f.pkl.zip
import pickle
func, X = pickle.load(open("foo_df323a9b-bbb7-49a3-b06e-a9699702c09f.pkl", "rb"))
func(X)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get
ret_val = func(*args, **kwargs)
File "cuml/neighbors/kneighbors_classifier.pyx", line 300, in cuml.neighbors.kneighbors_classifier.KNeighborsClassifier.predict_proba
File "cuml/raft/common/handle.pyx", line 86, in cuml.raft.common.handle.Handle.sync
cuml.raft.common.cuda.CudaRuntimeError: Error! cudaErrorIllegalAddress reason='an illegal memory access was encountered' extraMsg='Stream sync'
and upon exit of python interpreter there is core dump.
Seems possible that it is due to constant features, i.e. all 0 or all 1 etc.
This is using rapids 21.08 and other details about system are here: https://github.com/rapidsai/cuml/issues/4610
However, what's also really bad about this situation is that sometimes the predictions are generated but are wrong, or keep changing (e.g. recalls to predict_proba(X) keep giving different results), or (e.g.) for multiclass one case will have 0's for all probas
E.g. for this file: KNNCUML_predict_b30d7318-b285-475d-943e-c48ebd2235df.pkl.zip
This is what the sequence looks like:
import pickle
func, X = pickle.load(open("KNNCUML_predict_b30d7318-b285-475d-943e-c48ebd2235df.pkl", "rb"))
>>> func(X)[0:5]
2022-03-10 15:11:25,215 C: 3% D:43.6GB M:46.6GB NODE:SERVER 26864 INFO | init
array([[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.]], dtype=float32)
>>> func(X)[0:5]
array([[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.]], dtype=float32)
>>> func(X)[0:5]
array([[0., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.]], dtype=float32)
>>> func(X)[0:5]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get
ret_val = func(*args, **kwargs)
File "cuml/neighbors/kneighbors_classifier.pyx", line 300, in cuml.neighbors.kneighbors_classifier.KNeighborsClassifier.predict_proba
File "cuml/raft/common/handle.pyx", line 86, in cuml.raft.common.handle.Handle.sync
cuml.raft.common.cuda.CudaRuntimeError: Error! cudaErrorIllegalAddress reason='an illegal memory access was encountered' extraMsg='Stream sync'
>>> func(X)[0:5]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get
ret_val = func(*args, **kwargs)
File "cuml/neighbors/kneighbors_classifier.pyx", line 256, in cuml.neighbors.kneighbors_classifier.KNeighborsClassifier.predict_proba
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get
ret_val = func(*args, **kwargs)
File "cuml/neighbors/nearest_neighbors.pyx", line 488, in cuml.neighbors.nearest_neighbors.NearestNeighbors.kneighbors
File "cuml/neighbors/nearest_neighbors.pyx", line 573, in cuml.neighbors.nearest_neighbors.NearestNeighbors._kneighbors
File "cuml/neighbors/nearest_neighbors.pyx", line 635, in cuml.neighbors.nearest_neighbors.NearestNeighbors._kneighbors_dense
File "/home/jon/minicondadai_py38/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 360, in inner
return func(*args, **kwargs)
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/common/input_utils.py", line 306, in input_to_cuml_array
X = convert_dtype(X,
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 360, in inner
return func(*args, **kwargs)
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/common/input_utils.py", line 560, in convert_dtype
would_lose_info = _typecast_will_lose_information(X, to_dtype)
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/common/input_utils.py", line 612, in _typecast_will_lose_information
X_m = X.values
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cudf/core/dataframe.py", line 994, in values
return cupy.asarray(self.as_gpu_matrix())
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cudf/core/dataframe.py", line 3577, in as_gpu_matrix
matrix = cupy.empty(shape=(nrow, ncol), dtype=cupy_dtype, order=order)
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cupy/_creation/basic.py", line 22, in empty
return cupy.ndarray(shape, dtype, order=order)
File "cupy/_core/core.pyx", line 164, in cupy._core.core.ndarray.__init__
File "cupy/cuda/memory.pyx", line 735, in cupy.cuda.memory.alloc
File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/rmm/rmm.py", line 212, in rmm_cupy_allocator
buf = librmm.device_buffer.DeviceBuffer(size=nbytes, stream=stream)
File "rmm/_lib/device_buffer.pyx", line 84, in rmm._lib.device_buffer.DeviceBuffer.__cinit__
MemoryError: std::bad_alloc: CUDA error at: /home/jon/minicondadai_py38/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorIllegalAddress an illegal memory access was encountered
sometimes a repeat will do a full core dump.
So even when predictions don't cause a crash or error, they still give wrong/varying answers and even probas don't even add up to 1 (every class label has 0 proba).
The actual GPU usage is minimal:
Thu Mar 10 15:13:19 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 On | 00000000:01:00.0 On | N/A |
| 45% 53C P0 50W / 215W | 2021MiB / 7979MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2530 G /usr/lib/xorg/Xorg 55MiB |
| 0 N/A N/A 2621 G /usr/bin/gnome-shell 143MiB |
| 0 N/A N/A 3966 G /usr/lib/xorg/Xorg 738MiB |
| 0 N/A N/A 4097 G /usr/bin/gnome-shell 100MiB |
| 0 N/A N/A 4903 G ...AAAAAAAAA= --shared-files 132MiB |
| 0 N/A N/A 26864 C python 843MiB |
+-----------------------------------------------------------------------------+
Is this reproducible in RAPIDS 22.02 / 22.04?
I gave MRE so you guys can check.
This shouldn't have been closed.
Sorry for not replying earlier. It turns out that the pickle files could not be imported in 22.04. Since I could't see the code used, it's not possible for me to reproduce the issue.
For both exemples, I get :
Exception ignored in: <bound method NearestNeighbors.__del__ of KNeighborsClassifier()>
Traceback (most recent call last):
File "cuml/neighbors/nearest_neighbors.pyx", line 889, in cuml.neighbors.nearest_neighbors.NearestNeighbors.__del__
File "cuml/common/base.pyx", line 269, in cuml.common.base.Base.__getattr__
AttributeError: knn_index
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'cuml.raft'
Regarding the cudaErrorIllegalAddress
/coredump
there might possibly be a link to an issue in RMM that was since solved : https://github.com/rapidsai/rmm/pull/931.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.