cuml icon indicating copy to clipboard operation
cuml copied to clipboard

[BUG] calling `load_from_sklearn` from a ForestInference instance cause `segmentation`when predicting

Open daxiongshu opened this issue 1 year ago • 11 comments

Describe the bug For cuml 23.08 when calling load_from_sklearn from a ForestInference instance, the following predict aborts with a silent Segmentation fault (core dumped).

Steps/Code to reproduce bug

import cuml
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()
X, y = iris.data, iris.target

skl_model = RandomForestClassifier(n_estimators=10)
skl_model.fit(X, y)

fil_model = cuml.ForestInference()
fil_model.load_from_sklearn(skl_model, output_class=True)
fil_preds = fil_model.predict(X)
# Segmentation fault (core dumped)

Expected behavior it should either just work or throw a more informative error message such as using cuml.ForestInference.load_from_sklearn instead.

Environment details (please complete the following information):

  • Environment location: Bare-metal
  • Linux Distro/Architecture: Ubuntu 20.04.6 LTS
  • GPU Model/Driver: V100 and driver 525.105.17
  • CUDA: [11.8]
  • Method of cuDF & cuML install: conda Additional context Add any other context about the problem here.

daxiongshu avatar Aug 09 '23 18:08 daxiongshu

I was able to reproduce the error using the latest Docker image (rapidsai/rapidsai-core-nightly:23.08-cuda11.8-runtime-ubuntu22.04-py3.10).

Error:

[W] [18:38:04.294878] Treelite currently does not support float64 model parameters. Accuracy may degrade slightly relative to native sklearn invocation.
Segmentation fault (core dumped)

Furthermore, when I tried using the experimental version of FIL, I get a different error:

Traceback (most recent call last):
  File "/workspace/test.py", line 15, in <module>
    fil_preds = fil_model.predict(X)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper
    ret = func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "fil.pyx", line 1215, in cuml.experimental.fil.fil.ForestInference.predict
  File "base.pyx", line 315, in cuml.internals.base.Base.__getattr__
AttributeError: forest

Script using the experimental FIL:

import cuml
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from cuml.experimental import ForestInference

iris = load_iris()
X, y = iris.data, iris.target

skl_model = RandomForestClassifier(n_estimators=10)
skl_model.fit(X, y)

fil_model = ForestInference()
fil_model.load_from_sklearn(skl_model, output_class=True)
fil_preds = fil_model.predict(X)

hcho3 avatar Aug 09 '23 18:08 hcho3

Loading a model to an existing instance is not yet supported in experimental FIL. Currently, it must be loaded as:

fil_model = ForestInference.load_from_sklearn(skl_model, output_class=True)

wphicks avatar Aug 09 '23 18:08 wphicks

Indeed, using ForestInference.load_from_sklearn with experimental FIL works.

On the other hand, ForestInference.load_from_sklearn from the current FIL fails with this error:

[W] [18:50:58.669347] Treelite currently does not support float64 model parameters. Accuracy may degrade slightly relative to native sklearn invocation.
Error in sys.excepthook:
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/exceptiongroup/_formatting.py", line 71, in exceptiongroup_excepthook
TypeError: 'NoneType' object is not callable

Original exception was:
Traceback (most recent call last):
  File "fil.pyx", line 287, in cuml.fil.fil.ForestInference_impl.get_dtype
AttributeError: 'NoneType' object has no attribute 'float32'
Exception ignored in: 'cuml.fil.fil.ForestInference_impl.__dealloc__'
Traceback (most recent call last):
  File "fil.pyx", line 287, in cuml.fil.fil.ForestInference_impl.get_dtype
AttributeError: 'NoneType' object has no attribute 'float32'

hcho3 avatar Aug 09 '23 18:08 hcho3

We also need to throw an informative error when the user attempts to call load_from_sklearn with an existing object.

hcho3 avatar Aug 09 '23 18:08 hcho3

Update: I ran more experiments and here's what I found:

FIL Install method ForestInference().load_from_sklearn(...) ForestInference.load_from_sklearn(...)
Current FIL Build from source ✔️ ✔️
Current FIL Docker nightly (**) ❌ (segfault) ❌ (segfault)
Current FIL Conda nightly ❌ (segfault) ❌ (segfault)
Experimental FIL (*) Build from source ✔️ ✔️
Experimental FIL (*) Docker nightly (**) ✔️ ✔️
Experimental FIL (*) Conda nightly ✔️ ✔️

(*) cuml.experimental.ForestInference (**) rapidsai/base:23.08a-cuda11.8-py3.10

  • Commit ID of the source build: 07176ea74486ac68bf2731fdf54ecdf6afbc04e0 of branch-23.08
  • Output of conda list | grep cuml in the Docker container:
cuml                      23.08.00a       cuda11_py310_230802_g14d931a6e_55    rapidsai-nightly
libcuml                   23.08.00a       cuda11_230810_g07176ea74_59    rapidsai-nightly
  • Output of conda list | grep cuml after a local Conda install:
cuml                      23.08.00a       cuda11_py310_230810_g07176ea74_59    rapidsai-nightly
libcuml                   23.08.00a       cuda11_230810_g07176ea74_59    rapidsai-nightly
  • All segfaults are accompanied by the following message
Traceback (most recent call last):
  File "fil.pyx", line 287, in cuml.fil.fil.ForestInference_impl.get_dtype
AttributeError: 'NoneType' object has no attribute 'float32'
Exception ignored in: 'cuml.fil.fil.ForestInference_impl.__dealloc__'
Traceback (most recent call last):
  File "fil.pyx", line 287, in cuml.fil.fil.ForestInference_impl.get_dtype
AttributeError: 'NoneType' object has no attribute 'float32'
Segmentation fault (core dumped)

Perhaps the Numpy module is not being loaded correctly?

hcho3 avatar Aug 10 '23 21:08 hcho3

@wphicks I think there is something wrong with this import: https://github.com/rapidsai/cuml/blob/91d30fc305f399362c248f182a79fcc93c21a051/python/cuml/fil/fil.pyx#L20

NumPy is needed for the following lines, so we may want to import NumPy unconditionally: https://github.com/rapidsai/cuml/blob/91d30fc305f399362c248f182a79fcc93c21a051/python/cuml/fil/fil.pyx#L286-L288

hcho3 avatar Aug 10 '23 21:08 hcho3

Interesting! That import should (generally speaking) be fine because it will load numpy so long as it is available. If it is not available, we should be getting an UnavailableError. If something has changed in terms of how we interact with Cython that has compromised the safe import setup, we definitely need to get to the bottom of that. Let's see if we can find the root cause rather than just switching to a traditional import.

You're right though that we should probably be using the host_xpy setup we use elsewhere:

cp = gpu_only_import('cupy')
np = cpu_only_import('numpy')
host_xpy = cp if is_unavailable(np) else np

We should probably wrap that as a helper function for anywhere we need access to numpy/cupy and don't really care which.

wphicks avatar Aug 11 '23 17:08 wphicks

It's concerning that the import fails only when using Docker or Conda install. I could not reproduce it when building cuML from the source.

hcho3 avatar Aug 11 '23 17:08 hcho3

@wphicks Given the lack of bandwidth on our part, can we switch back to traditional import to unblock users to use load_from_sklearn?

hcho3 avatar Sep 11 '23 17:09 hcho3

It's not clear to me that that will actually solve this issue or if it does that we will not see it elsewhere. Does the host_xpy solution above not work for us?

wphicks avatar Sep 11 '23 21:09 wphicks

It's not clear to me that that will actually solve this issue

The experimental FIL uses traditional imports and its import function is working.

Does the host_xpy solution above not work for us?

This is a bit difficult to verify, since the bug only occurs when using the Docker container or Conda nightly. When my bandwidth allows, I can learn to build the container from the source.

hcho3 avatar Sep 11 '23 21:09 hcho3

The same issue still persists in 24.08 nightly Docker.

hcho3 avatar Jul 23 '24 21:07 hcho3