xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

Test XGBoost in WSL2 + CUDA

Open startwarfields opened this issue 4 years ago • 26 comments

Hello, Ive run into an error trying to get XGboost GPU working on WSL2. My usage of XGboost GPU works on a Linux install with CUDA, and Im fairly confident this is a driver issue with CUDA. Someone recommended me opening this issue here just so its kept track of.

Environment:

Windows Build 20161 Ubuntu Windows Subsystem Linux 2 - Linux Version 4.19.121-microsoft-standard NVIDA Driver 455.41 / CUDA 11.0 GPU : GTX 1070

Running Python XGBoost (XGBRegressor) with tree method of gpu_hist & gpu_id=0 causes the following error

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  device free failed: unknown error
Aborted

startwarfields avatar Jul 07 '20 22:07 startwarfields

wsl2 has supported GPU?Microsoft has not release such such version,wsl2 supports GPU still need for Microsoft's development

graceyangfan avatar Jul 08 '20 04:07 graceyangfan

https://docs.microsoft.com/en-us/windows/wsl/install-win10

startwarfields avatar Jul 08 '20 07:07 startwarfields

https://docs.nvidia.com/cuda/wsl-user-guide/index.html#installing-wsl2

startwarfields avatar Jul 08 '20 07:07 startwarfields

I am out of time

graceyangfan avatar Jul 09 '20 10:07 graceyangfan

It would be great if we can test XGBoost in WSL2 + CUDA and see whether various functionalities will work.

Note. We cannot use multi-GPU training because NCCL does not support WSL2 yet.

hcho3 avatar Sep 27 '20 07:09 hcho3

To new contributors: Post a comment here if you'd like to test XGBoost with WSL2. I am available for help and guidance. After testing, submit a pull request to update the doc to document which feature of XGBoost are currently functional with WSL2. This way you can claim credit for Hacktoberfest 2020.

hcho3 avatar Sep 27 '20 07:09 hcho3

Hi, I might could do some tests in GTX1050Ti. Can you point out which tests to begin with?

*edit : in GTX1650

otivedani avatar Oct 15 '20 08:10 otivedani

Hi @otivedani You can start with google test: https://xgboost.readthedocs.io/en/latest/contrib/unit_tests.html#running-gtest

trivialfis avatar Oct 15 '20 10:10 trivialfis

I have done google test, and this is my logs : https://github.com/otivedani/xgboost/tree/wsl-test-logs/build/wsl2-ubuntu2004/logs

I have encountered this error at make test https://github.com/otivedani/xgboost/blob/wsl-test-logs/build/wsl2-ubuntu2004/logs/02_maketest_gtest.log :

83% tests passed, 1 tests failed out of 6


Total Test time (real) =  70.39 sec


The following tests FAILED:
	  1 - TestXGBoostLib (Child aborted)
Errors while running CTest
make: *** [Makefile:130: test] Error 8

while detailed ctest -VV is here https://github.com/otivedani/xgboost/blob/wsl-test-logs/build/wsl2-ubuntu2004/logs/03_ctest_gtest.log

system info :

Windows Version 2004 Build 20236.1005 WSL 2 Ubuntu 20.04 (Linux PC 4.19.128-microsoft-standard) NVIDIA GTX 1650, CUDA Toolkit 11.1 NCCL 2.7.8

please let me know what you think.

otivedani avatar Oct 20 '20 10:10 otivedani

Great! Can you also try running the unit tests for Python? We'd like to document how well XGBoost works in WSL2.

hcho3 avatar Oct 21 '20 08:10 hcho3

Sure! After running pytest with and without gpu, this is my logs : https://github.com/otivedani/xgboost/tree/wsl-test-logs/build/wsl2-ubuntu2004/logs/pytest

        if ret != 0:
>           raise XGBoostError(py_str(_LIB.XGBGetLastError()))
E           xgboost.core.XGBoostError: [22:19:17] /home/otivedani/xgboost/src/gbm/../common/common.h:156: XGBoost version not compiled with GPU support.
E           Stack trace:
E             [bt] (0) /home/otivedani/xgboost/venv/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x79) [0x7fdf55f41f79]
E             [bt] (1) /home/otivedani/xgboost/venv/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::ConfigureUpdaters()+0x105) [0x7fdf560328c5]
E             [bt] (2) /home/otivedani/xgboost/venv/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::Configure(std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0x238) [0x7fdf56037658]
E             [bt] (3) /home/otivedani/xgboost/venv/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerConfiguration::Configure()+0x87f) [0x7fdf56074cff]
E             [bt] (4) /home/otivedani/xgboost/venv/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x7e) [0x7fdf56062a0e]
E             [bt] (5) /home/otivedani/xgboost/venv/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x69) [0x7fdf55f38549]
E             [bt] (6) /lib/x86_64-linux-gnu/libffi.so.7(+0x6ff5) [0x7fdf7795aff5]
E             [bt] (7) /lib/x86_64-linux-gnu/libffi.so.7(+0x640a) [0x7fdf7795a40a]
E             [bt] (8) /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x58c) [0x7fdf7797328c]

venv/lib/python3.8/site-packages/xgboost/core.py:186: XGBoostError
=============================== warnings summary ===============================

https://raw.githubusercontent.com/otivedani/xgboost/wsl-test-logs/build/wsl2-ubuntu2004/logs/pytest/00-python-gpu-test.log

steps to reproduce :

# using virtualenv
python3 -m venv venv
source venv/bin/activate
# using last build
python setup.py develop --use-cuda --use-nccl
# install dependencies (latest)
pip install -r ./doc/requirements.txt
pip install numpy scikit-learn
sudo apt install graphviz
# tests
export PYTHONPATH=./venv/lib/python3.8/site-packages:./python-package
pytest -v -s --fulltrace tests/python
pytest -v -s --fulltrace tests/python-gpu

without using pip setup.py develop :

E           xgboost.core.XGBoostError: [05:18:19] /home/otivedani/xgboost/src/tree/updater_gpu_hist.cu:786: Exception in gpu_hist: NCCL failure :unhandled system error /home/otivedani/xgboost/src/common/device_helpers.cu(71)

I have tried build I made from gtest before as well as creating new build without (no GOOGLE_TEST=ON), but result is the same. Maybe is there any step I missed?

note : there is this warning after installing graphviz (and libcuda, iirc) from apt /sbin/ldconfig.real: /usr/lib/wsl/lib/libcuda.so.1 is not a symbolic link possible linked issues : WSL/issues#5548

otivedani avatar Oct 23 '20 09:10 otivedani

I've also run into a similar error trying to get XGBoost GPU working on WSL2 using Python / Dask / Cuda, which also resulted in a temporary black screen for the second monitor :-o

Attempting to train XGBoost with tree method of gpu_hist causes the following error:
[23:10:50] task [xgboost.dask]:tcp://127.0.0.1:45075 got new rank 0
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  for_each: failed to synchronize: cudaErrorUnknown: unknown error

Note that I have been able to use Dask / Cuda on WSL2 for some other ML algorithms, e.g. kmeans I've also successfully trained XGBoost on WSL2 using Python / Pandas (i.e. on CPU), but so far not on the GPU (with or without Dask)

Environment

  • Microsoft Windows 10 Pro Insider Preview
  • Version 2004 (Build 21277)
  • Windows Subsystem Linux 2 - Ubuntu 20.04
  • NVIDA Driver 465.21 / CUDA 11.0

Code

The code I ran was taken from https://medium.com/rapids-ai/a-new-official-dask-api-for-xgboost-e8b10f3d1eb7 Data source: https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz

from dask.distributed import Client
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()
client = Client(cluster)
import xgboost as xgb
import dask_cudf
fname = './data/HIGGS.csv'
colnames = ["label"] + ["feature-%02d" % i for i in range(1,29)]
dask_df = dask_cudf.read_csv(fname, header=None, names=colnames)
X = dask_df[dask_df.columns.difference(['label'])]
y = dask_df['label']
dtrain = xgb.dask.DaskDMatrix(client, X, y)
params = {'tree_method': 'gpu_hist'}
output = xgb.dask.train(client, params, dtrain, num_boost_round=100)

BenWynne-Morris avatar Jan 08 '21 00:01 BenWynne-Morris

As an update, I if attempt to train XGBoost with a cuDF datarame on the GPU ('tree_method': 'gpu_hist') it fails with the following message:

XGBoostError: [17:01:27] /opt/conda/envs/rapids/conda-bld/xgboost_1607619219243/work/src/tree/updater_gpu_hist.cu:786: Exception in gpu_hist: NCCL failure :unhandled system error /opt/conda/envs/rapids/conda-bld/xgboost_1607619219243/work/src/common/device_helpers.cu(71)

Environment

  • Microsoft Windows 10 Pro Insider Preview
  • Version 2004 (Build 21277)
  • Windows Subsystem Linux 2 - Ubuntu 20.04
  • NVIDA Driver 465.21 / CUDA 11.0

(Note - I am NOT using Dask in any way, and have only 1 graphics card GeForce RTX 2070)

Code

import cudf
from cuml.preprocessing.model_selection   import train_test_split
import xgboost as xgb
gdf =   cudf.read_csv('./data/pop_2-08.csv', usecols=['age', 'sex', 'northing',   'easting', 'infected'])
x_train, x_test, y_train, y_test =   train_test_split(gdf[['age', 'sex', 'northing', 'easting']], gdf['infected'])
del(gdf)
params = {
'max_depth':    8,
'max_leaves':   2**8,
'tree_method':  'gpu_hist',
'objective':      'binary:logistic',
'grow_policy':  'lossguide',
'eval_metric':  'logloss',
'subsample':    '0.8'}
 
dtrain = xgb.DMatrix(x_train, y_train)
%time model = xgb.train(params, dtrain,   num_boost_round=100)

If I change to 'tree_method':  'hist', it trains without an error, i.e. on CPU with similar wall time to training regular XGBoost with a Pandas dataframe.

BenWynne-Morris avatar Jan 09 '21 17:01 BenWynne-Morris

@hcho3 Hi! Can I still jump in to this issue? I can test it in WSL2 with CUDA on Win11 dev build. System info:

Windows 11 Pro Insider Preview. Build 22471.rs_prerelease.210929-1415
WSL 2 Ubuntu 18.04 (Linux 5.10.60.1-microsoft-standard-WSL2)
NVIDIA GTX 1070, CUDA Toolkit 11.0

hilbert-yaa avatar Nov 10 '21 15:11 hilbert-yaa

XGBoostError: [17:01:27] /opt/conda/envs/rapids/conda-bld/xgboost_1607619219243/work/src/tree/updater_gpu_hist.cu:786: Exception in gpu_hist: NCCL failure :unhandled system error /opt/conda/envs/rapids/conda-bld/xgboost_1607619219243/work/src/common/device_helpers.cu(71)

Oh my goodness, I had the same error and took me an hour to find your results. So sad to see. WSL fails in mysterious ways like this way too often. I'm gonna have to dual boot linux, or maybe buy a mac. Life is pain punctuated with only brief moments of glorious unconsciousness.

KastanDay avatar Dec 03 '21 05:12 KastanDay

Now I got the same error on a linux server with 4 GPUs, even though I'm only specifying gpu_id=0.

xgboost.core.XGBoostError: [22:52:08] /opt/anaconda/conda-bld/xgboost-base_1601008358431/work/src/tree/updater_gpu_hist.cu:1407: Exception in gpu_hist: NCCL failure :unhandled cuda error /opt/anaconda/conda-bld/xgboost-base_1601008358431/work/src/tree/../common/device_helpers.cuh(896)

KastanDay avatar Dec 03 '21 05:12 KastanDay

Life is pain punctuated with only brief moments of glorious unconsciousness.

Wow, what did Linux do to you ...

I don't have a Windows instance for testing, would be great if someone can debug the errors and see which cuda function is malfunctioning.

trivialfis avatar Dec 03 '21 20:12 trivialfis

Wow, what did Linux do to you ...

I love linux, it's Windows that is killing me (always WSL problems when u try to get fancy, especially related to GPUs.)

I would recommend against getting a Windows machine.

KastanDay avatar Dec 04 '21 22:12 KastanDay

I experienced this same error with my setup

python version 3.9.1 xgboost version 1.5.1 scikit-learn version 1.0.2 CUDA version 11.5 Nvidia Driver version 469.49

XGBoostError: [19:30:57] ../src/tree/updater_gpu_hist.cu:770: Exception in gpu_hist: [19:30:57] ../src/common/device_helpers.cuh:132: NCCL failure :unhandled system error ../src/common/device_helpers.cu(67)

It is too bad because WSL2 can run pytorch using GPU. I would love to use both to develop model. Is there anything that I can do to help to debug this bug?

mitbal avatar Jan 08 '22 12:01 mitbal

Same here: python version 3.8.12 xgboost version 1.5.2 scikit-learn version 1.0.2 CUDA version 11.5 Nvidia Driver version 496.49

XGBoostError: [17:58:22] ../src/tree/updater_gpu_hist.cu:770: Exception in gpu_hist: [17:58:22] ../src/common/device_helpers.cuh:132: NCCL failure :unhandled system error ../src/common/device_helpers.cu(67)

Would also be keen to have this working ... Tensorflow works perfectly ;)

peter-fm avatar Mar 04 '22 18:03 peter-fm

I think the issue is the fact that XGBoost wants to use NCCL, and the NCCL version that is compiled with xgboost is too old to work with WSL2. I'm trying to build XGBoost python package without NCCL, and I think there are bugs with this CMake setup.

If you set the NCCL debug env variable, you get more info: NCCL_DEBUG=INFO:

bp-trading:15738:15738 [0] NCCL INFO Bootstrap : Using [0]eth0:172.31.212.247<0>
bp-trading:15738:15738 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

bp-trading:15738:15738 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
bp-trading:15738:15738 [0] NCCL INFO NET/Socket : Using [0]eth0:172.31.212.247<0>
bp-trading:15738:15738 [0] NCCL INFO Using network Socket
NCCL version 2.7.3+cuda11.0

bp-trading:15738:15738 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:91/../../0000:91:00.0
bp-trading:15738:15738 [0] NCCL INFO graph/xml.cc:469 -> 2
bp-trading:15738:15738 [0] NCCL INFO graph/xml.cc:655 -> 2
bp-trading:15738:15738 [0] NCCL INFO graph/topo.cc:523 -> 2
bp-trading:15738:15738 [0] NCCL INFO init.cc:586 -> 2
bp-trading:15738:15738 [0] NCCL INFO init.cc:845 -> 2
bp-trading:15738:15738 [0] NCCL INFO init.cc:881 -> 2
bp-trading:15738:15738 [0] NCCL INFO init.cc:892 -> 2

I think if you have NCCL 2.12 and above, it might work.

polidore avatar Jun 16 '22 17:06 polidore

I was able to build xgboost python package with cuda 11.4 and nccl 2.12, and it now uses my GPU in WSL. It would be great for the official xgboost python package to use nccl 2.12 so this is easier for everyone!

polidore avatar Jul 23 '22 16:07 polidore

If I understand correctly, updating to nccl 2.12 helped and PR was merged but there is no official python package released to test it out ? 😞

Temppus avatar Aug 08 '22 20:08 Temppus

@Temppus Feel free to try the nightly https://xgboost.readthedocs.io/en/stable/install.html#id1 .

trivialfis avatar Aug 10 '22 02:08 trivialfis

Hi @trivialfis. Thanks for suggestion I will try that,

Temppus avatar Aug 10 '22 19:08 Temppus

Wow, great ! Using https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/master/xgboost-2.0.0.dev0%2B446d536c23c5451eaf2879c5b266a2a68ceb07ec-py3-none-manylinux2014_x86_64.whl it works on GPU on windows with WSL2 and linux container and training is blazing fast.

Thank you very much ! 😄

Temppus avatar Aug 10 '22 20:08 Temppus

Excellent! I think we can conclude this issue now.

trivialfis avatar Aug 11 '22 05:08 trivialfis

Following @Temppus's solutıon, (whıch perfectly works), do the followıng:

If you are on Lınux or WSL2:

$ wget https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/master/xgboost-2.0.0.dev0%2B446d536c23c5451eaf2879c5b266a2a68ceb07ec-py3-none-manylinux2014_x86_64.whl

$ pip install --upgrade pip
$ pip install https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/master/xgboost-2.0.0.dev0%2B446d536c23c5451eaf2879c5b266a2a68ceb07ec-py3-none-manylinux2014_x86_64.whl

If you are on Wındows, just clıck the lınk to download the wheel fıle and run the pip install command.

I am assumıng you are ınsıde a vırtual env, of course.

BexTuychiev avatar Aug 27 '22 11:08 BexTuychiev