cuml
cuml copied to clipboard
[BUG] calling `load_from_sklearn` from a ForestInference instance cause `segmentation`when predicting
Describe the bug
For cuml 23.08
when calling load_from_sklearn
from a ForestInference instance, the following predict
aborts with a silent Segmentation fault (core dumped)
.
Steps/Code to reproduce bug
import cuml
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
iris = load_iris()
X, y = iris.data, iris.target
skl_model = RandomForestClassifier(n_estimators=10)
skl_model.fit(X, y)
fil_model = cuml.ForestInference()
fil_model.load_from_sklearn(skl_model, output_class=True)
fil_preds = fil_model.predict(X)
# Segmentation fault (core dumped)
Expected behavior
it should either just work or throw a more informative error message such as using cuml.ForestInference.load_from_sklearn
instead.
Environment details (please complete the following information):
- Environment location: Bare-metal
- Linux Distro/Architecture: Ubuntu 20.04.6 LTS
- GPU Model/Driver: V100 and driver 525.105.17
- CUDA: [11.8]
- Method of cuDF & cuML install: conda Additional context Add any other context about the problem here.
I was able to reproduce the error using the latest Docker image (rapidsai/rapidsai-core-nightly:23.08-cuda11.8-runtime-ubuntu22.04-py3.10
).
Error:
[W] [18:38:04.294878] Treelite currently does not support float64 model parameters. Accuracy may degrade slightly relative to native sklearn invocation.
Segmentation fault (core dumped)
Furthermore, when I tried using the experimental version of FIL, I get a different error:
Traceback (most recent call last):
File "/workspace/test.py", line 15, in <module>
fil_preds = fil_model.predict(X)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper
ret = func(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nvtx/nvtx.py", line 101, in inner
result = func(*args, **kwargs)
File "fil.pyx", line 1215, in cuml.experimental.fil.fil.ForestInference.predict
File "base.pyx", line 315, in cuml.internals.base.Base.__getattr__
AttributeError: forest
Script using the experimental FIL:
import cuml
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from cuml.experimental import ForestInference
iris = load_iris()
X, y = iris.data, iris.target
skl_model = RandomForestClassifier(n_estimators=10)
skl_model.fit(X, y)
fil_model = ForestInference()
fil_model.load_from_sklearn(skl_model, output_class=True)
fil_preds = fil_model.predict(X)
Loading a model to an existing instance is not yet supported in experimental FIL. Currently, it must be loaded as:
fil_model = ForestInference.load_from_sklearn(skl_model, output_class=True)
Indeed, using ForestInference.load_from_sklearn
with experimental FIL works.
On the other hand, ForestInference.load_from_sklearn
from the current FIL fails with this error:
[W] [18:50:58.669347] Treelite currently does not support float64 model parameters. Accuracy may degrade slightly relative to native sklearn invocation.
Error in sys.excepthook:
Traceback (most recent call last):
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/exceptiongroup/_formatting.py", line 71, in exceptiongroup_excepthook
TypeError: 'NoneType' object is not callable
Original exception was:
Traceback (most recent call last):
File "fil.pyx", line 287, in cuml.fil.fil.ForestInference_impl.get_dtype
AttributeError: 'NoneType' object has no attribute 'float32'
Exception ignored in: 'cuml.fil.fil.ForestInference_impl.__dealloc__'
Traceback (most recent call last):
File "fil.pyx", line 287, in cuml.fil.fil.ForestInference_impl.get_dtype
AttributeError: 'NoneType' object has no attribute 'float32'
We also need to throw an informative error when the user attempts to call load_from_sklearn
with an existing object.
Update: I ran more experiments and here's what I found:
FIL | Install method | ForestInference().load_from_sklearn(...) |
ForestInference.load_from_sklearn(...) |
---|---|---|---|
Current FIL | Build from source | ✔️ | ✔️ |
Current FIL | Docker nightly (**) | ❌ (segfault) | ❌ (segfault) |
Current FIL | Conda nightly | ❌ (segfault) | ❌ (segfault) |
Experimental FIL (*) | Build from source | ✔️ | ✔️ |
Experimental FIL (*) | Docker nightly (**) | ✔️ | ✔️ |
Experimental FIL (*) | Conda nightly | ✔️ | ✔️ |
(*) cuml.experimental.ForestInference
(**) rapidsai/base:23.08a-cuda11.8-py3.10
- Commit ID of the source build:
07176ea74486ac68bf2731fdf54ecdf6afbc04e0
ofbranch-23.08
- Output of
conda list | grep cuml
in the Docker container:
cuml 23.08.00a cuda11_py310_230802_g14d931a6e_55 rapidsai-nightly
libcuml 23.08.00a cuda11_230810_g07176ea74_59 rapidsai-nightly
- Output of
conda list | grep cuml
after a local Conda install:
cuml 23.08.00a cuda11_py310_230810_g07176ea74_59 rapidsai-nightly
libcuml 23.08.00a cuda11_230810_g07176ea74_59 rapidsai-nightly
- All segfaults are accompanied by the following message
Traceback (most recent call last):
File "fil.pyx", line 287, in cuml.fil.fil.ForestInference_impl.get_dtype
AttributeError: 'NoneType' object has no attribute 'float32'
Exception ignored in: 'cuml.fil.fil.ForestInference_impl.__dealloc__'
Traceback (most recent call last):
File "fil.pyx", line 287, in cuml.fil.fil.ForestInference_impl.get_dtype
AttributeError: 'NoneType' object has no attribute 'float32'
Segmentation fault (core dumped)
Perhaps the Numpy module is not being loaded correctly?
@wphicks I think there is something wrong with this import: https://github.com/rapidsai/cuml/blob/91d30fc305f399362c248f182a79fcc93c21a051/python/cuml/fil/fil.pyx#L20
NumPy is needed for the following lines, so we may want to import NumPy unconditionally: https://github.com/rapidsai/cuml/blob/91d30fc305f399362c248f182a79fcc93c21a051/python/cuml/fil/fil.pyx#L286-L288
Interesting! That import should (generally speaking) be fine because it will load numpy so long as it is available. If it is not available, we should be getting an UnavailableError
. If something has changed in terms of how we interact with Cython that has compromised the safe import setup, we definitely need to get to the bottom of that. Let's see if we can find the root cause rather than just switching to a traditional import.
You're right though that we should probably be using the host_xpy
setup we use elsewhere:
cp = gpu_only_import('cupy')
np = cpu_only_import('numpy')
host_xpy = cp if is_unavailable(np) else np
We should probably wrap that as a helper function for anywhere we need access to numpy/cupy and don't really care which.
It's concerning that the import fails only when using Docker or Conda install. I could not reproduce it when building cuML from the source.
@wphicks Given the lack of bandwidth on our part, can we switch back to traditional import to unblock users to use load_from_sklearn
?
It's not clear to me that that will actually solve this issue or if it does that we will not see it elsewhere. Does the host_xpy
solution above not work for us?
It's not clear to me that that will actually solve this issue
The experimental FIL uses traditional imports and its import function is working.
Does the host_xpy solution above not work for us?
This is a bit difficult to verify, since the bug only occurs when using the Docker container or Conda nightly. When my bandwidth allows, I can learn to build the container from the source.
The same issue still persists in 24.08 nightly Docker.