LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

BUG in GPU histogram

Open lorenzoridolfi opened this issue 7 years ago • 48 comments

Environment info

Operating System: Fedora 26 CPU: I5 GPU: NVidia GTX 1060 C++/Python/R version: Python 3.6.2 Cuda 9.0

Error Message:

[LightGBM] [Info] Number of positive: 17355, number of negative: 458814 [LightGBM] [Warning] Only find one worker, will switch to serial tree learner. [LightGBM] [Info] This is the GPU trainer!! [LightGBM] [Info] Total Bins 1357 [LightGBM] [Info] Number of data: 476169, number of used features: 57 [LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation [LightGBM] [Info] Compiling OpenCL Kernel with 256 bins... [LightGBM] [Info] GPU programs have been built [LightGBM] [Info] Size of histogram bin entry: 12 [LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.048936 secs. 9 sparse feature groups. [LightGBM] [Info] Number of positive: 17355, number of negative: 458814 [LightGBM] [Warning] Only find one worker, will switch to serial tree learner. [LightGBM] [Info] This is the GPU trainer!! [LightGBM] [Info] Total Bins 1357 [LightGBM] [Info] Number of data: 476169, number of used features: 57 [LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation [LightGBM] [Info] Compiling OpenCL Kernel with 256 bins... [LightGBM] [Info] GPU programs have been built [LightGBM] [Info] Size of histogram bin entry: 12 [LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.048049 secs. 9 sparse feature groups. [LightGBM] [Info] Number of positive: 17355, number of negative: 458814 [LightGBM] [Warning] Only find one worker, will switch to serial tree learner. [LightGBM] [Info] This is the GPU trainer!! [LightGBM] [Info] Total Bins 1357 [LightGBM] [Info] Number of data: 476169, number of used features: 57 [LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation [LightGBM] [Info] Compiling OpenCL Kernel with 256 bins... [LightGBM] [Info] GPU programs have been built [LightGBM] [Info] Size of histogram bin entry: 12 [LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.039569 secs. 9 sparse feature groups. [LightGBM] [Info] Number of positive: 17355, number of negative: 458815 [LightGBM] [Warning] Only find one worker, will switch to serial tree learner. [LightGBM] [Info] This is the GPU trainer!! [LightGBM] [Info] Total Bins 1357 [LightGBM] [Info] Number of data: 476170, number of used features: 57 [LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation [LightGBM] [Info] Compiling OpenCL Kernel with 256 bins... [LightGBM] [Info] GPU programs have been built [LightGBM] [Info] Size of histogram bin entry: 12 [LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.035209 secs. 9 sparse feature groups. [LightGBM] [Info] Number of positive: 17356, number of negative: 458815 [LightGBM] [Warning] Only find one worker, will switch to serial tree learner. [LightGBM] [Info] This is the GPU trainer!! [LightGBM] [Info] Total Bins 1357 [LightGBM] [Info] Number of data: 476171, number of used features: 57 [LightGBM] [Info] Using GPU Device: GeForce GTX 1060 6GB, Vendor: NVIDIA Corporation [LightGBM] [Info] Compiling OpenCL Kernel with 256 bins... [LightGBM] [Info] GPU programs have been built [LightGBM] [Info] Size of histogram bin entry: 12 [LightGBM] [Info] 43 dense feature groups (19.98 MB) transfered to GPU in 0.040315 secs. 9 sparse feature groups. [LightGBM] [Fatal] Bug in GPU histogram! split 8211: 11359, smaller_leaf: 9610, larger_leaf: 9960

Traceback (most recent call last): File "lightgbm_param.py", line 127, in main() File "lightgbm_param.py", line 79, in main categorical_feature=cat_index_2) File "/usr/local/anaconda3/lib/python3.6/site-packages/lightgbm/engine.py", line 443, in cv cvfolds.update(fobj=fobj) File "/usr/local/anaconda3/lib/python3.6/site-packages/lightgbm/engine.py", line 244, in handlerFunction ret.append(getattr(booster, name)(*args, **kwargs)) File "/usr/local/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py", line 1436, in update ctypes.byref(is_finished))) File "/usr/local/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py", line 48, in _safe_call raise LightGBMError(_LIB.LGBM_GetLastError()) lightgbm.basic.LightGBMError: b'Bug in GPU histogram! split 8211: 11359, smaller_leaf: 9610, larger_leaf: 9960\n'

Reproducible examples

	params = {
			    'boosting_type': 'gbdt',
			    'objective': 'binary',
			    'metric': 'binary_logloss',
			    'num_leaves': 31,
			    'learning_rate': 0.005,
			    'feature_fraction': 0.9,
			    'bagging_fraction': 0.8,
			    'verbose': 1,
			    'device' : 'gpu'
			}

	d_train = lgb.Dataset(all_x, label=all_y)

	cv_results = lgb.cv(params,
			                d_train,
			                num_boost_round=700,
			                categorical_feature=cat_index_2)

lorenzoridolfi avatar Oct 20 '17 19:10 lorenzoridolfi

Thanks for reporting this problem! There might be a bug trigger by a race condition in the GPU code. I guess it is related to the feature_fraction and bagging_fraction parameters. Could you please change them to 1.0 and see which parameter causes the problem?

I will also really appreciate if you can reproduce the problem on any public datasets, or share the dataset with me if it is not sensitive. This will greatly help me debug this issue. Thank you!

huanzhang12 avatar Oct 24 '17 17:10 huanzhang12

Hi, setting these two parameters to 1.0 the bug happened, too, but It took several iterations to occur. With the old values the bug happened with very few iterations.

The source code is: https://www.dropbox.com/s/bqj428pc5vwcpp9/lightgbm_param.py?dl=0

And the data files are: https://www.dropbox.com/s/6lbpn54sdqn98kd/train.csv?dl=0 https://www.dropbox.com/s/lv8sam3tx415x62/test.csv?dl=0

Best Regards, Lorenzo

lorenzoridolfi avatar Oct 29 '17 14:10 lorenzoridolfi

@lorenzoridolfi Thank you for the detailed information on code and data! They are really helpful. I got a little bit busy recently but I will try to catch this bug as quickly as I can.

huanzhang12 avatar Nov 02 '17 04:11 huanzhang12

Any news about this bug? It's almost a month!

Thank you, Lorenzo

lorenzoridolfi avatar Nov 16 '17 17:11 lorenzoridolfi

ping @huanzhang12 if you have any news

Laurae2 avatar Nov 22 '17 19:11 Laurae2

Sorry I got crazily busy recently and did not get a chance to look into this bug. Will try to work on this during thanksgiving holiday. Thanks for your understanding!

huanzhang12 avatar Nov 22 '17 22:11 huanzhang12

Is this bug related to bin size error? For example when I use GPU-version lgbm

"bin size 16855 cannot run on GPU" error happens.

mjaysonnn avatar Nov 23 '17 11:11 mjaysonnn

@mjaysonnn GPU version cannot support categorical features with high cardinality. You can fix it by split one categorical feature into multi categorical features.

guolinke avatar Dec 14 '17 09:12 guolinke

I am also getting this error, using the latest version of LightGBM:

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 165885
[LightGBM] [Info] Number of data: 4561756, number of used features: 658
[LightGBM] [Info] Using GPU Device: GeForce GTX 1080 Ti, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (2836.48 MB) transfered to GPU in 1.880998 secs. 7 sparse feature groups
[LightGBM] [Info] Start training from score 0.466854
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.932567 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.892417 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.904232 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.882715 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.908154 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.907432 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.875952 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.865907 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.862585 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.892193 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.891429 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.915810 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.896881 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.894748 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.934688 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.908344 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.879211 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.877866 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.889270 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.850099 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.931063 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.910245 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.856059 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.905356 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.881266 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.867991 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.876403 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.873414 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.899580 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.908517 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.881924 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.907277 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.874490 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.889323 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.890838 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.871581 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.875327 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.885586 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.895304 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.895696 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.935202 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.870557 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.865199 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.912166 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.891637 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.882426 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.894549 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.854855 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.863332 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.881228 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.875864 secs. 7 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 12
[LightGBM] [Info] 651 dense feature groups (1418.24 MB) transfered to GPU in 0.885681 secs. 7 sparse feature groups
[LightGBM] [Fatal] Bug in GPU histogram! split 139388: 62131, smaller_leaf: 62132, larger_leaf: 139387

mjmckp avatar Jul 23 '18 03:07 mjmckp

@mjmckp Could you please provide the dataset and the python/shell script you used to reproduce this error? This will be really helpful for me to debug this issue.

I tried to reproduce the bug with the dataset and code provided by @lorenzoridolfi but I cannot reproduce it on three different machines. I tried different feature_fraction and bagging_fraction values but still cannot make the bug appear. @lorenzoridolfi Could you please try the latest LightGBM and see if you are still encountering the same error?

huanzhang12 avatar Jul 23 '18 10:07 huanzhang12

Self-contained repro here: https://www.dropbox.com/sh/9f9u7wm5ithfjbr/AADcQ6k8yDSkA3J3vYqsg4Hta?dl=0

Unzip the file dataset.zip and run lightgbm.exe config=repro.conf, console output is in output.txt.

I am running with:

  • LightGBM built from the current master branch (8ce2a232e907d518979e7105842ae575a7427377)
  • Windows 10 Professional
  • NVidia GTX 1080 Ti

mjmckp avatar Jul 24 '18 06:07 mjmckp

@mjmckp Thank you for providing the dataset and config files! I still cannot reproduce this problem on AMD and NVIDIA GPUs on my machines. However I did observe GPU hang on an Intel integrated GPU, which was not tested thoroughly before.

There might be a bug with max_bin=255. Could you please try to use max_bin=63 and see if this bug still occurs (make sure the log says Compiling OpenCL Kernel with 64 bins). If it disappears, I will investigate the OpenCL kernel for 256 bins carefully.

@mjmckp Another possibility is here: https://github.com/Microsoft/LightGBM/blob/master/src/treelearner/gpu_tree_learner.cpp#L119 If changing max_bin=63 does not work, could you please also try uncomment this line (return 0;) to make GetNumWorkgroupsPerFeature return 0?

huanzhang12 avatar Jul 24 '18 09:07 huanzhang12

After setting max_bin=63 (both when creating the dataset and the trainer), I still get Compiling OpenCL Kernel with 256 bins..., how could this be?

mjmckp avatar Jul 24 '18 11:07 mjmckp

@mjmckp you need to delete the binary training file and regenerate it using save_binary=true

huanzhang12 avatar Jul 24 '18 11:07 huanzhang12

Ok, thanks. Setting max_bin=63 also fails with the same error. I have updated the dropbox directory above with two new files:

  • output2.txt: console output
  • dataset2.zip: new dataset saved with max_bin=63

Btw, when trying to debug this, I tried using LightGBM compiled with #define GPU_DEBUG_COMPARE uncommented in gpu_tree_learner.cpp, however this generates an access violation. I also tried setting #define GPU_DEBUG 4, however this generates some compile errors and also runtime errors after working around the compile errors...

mjmckp avatar Jul 24 '18 12:07 mjmckp

I also tried altering GetNumWorkgroupsPerFeature to return 0, and got the same exception.

mjmckp avatar Jul 25 '18 03:07 mjmckp

@mjmckp Thank you for providing the new dataset and trying to debug this problem! Unfortunately, I still cannot reproduce the problem with max_bin=64. However, I fixed the GPU debugging mechanism. You can apply the patch here: https://gist.github.com/huanzhang12/f4f462c56b1920c8e59f3c729e124447 and then #define GPU_DEBUG_COMPARE should work.

huanzhang12 avatar Jul 25 '18 05:07 huanzhang12

@mjmckp you can also try this branch and see if it fixes it: https://github.com/Microsoft/LightGBM/tree/gpu_fix I added a few more boundary checks in the GPU code, but I am not sure if this is the problem.

huanzhang12 avatar Jul 25 '18 10:07 huanzhang12

Thanks. Btw, I added output.zip to the Dropbox directory which contains the console output when run with the patch you gave me (using the second data set with max_bin=63). It contains several failures.

On Wed., 25 Jul. 2018, 8:57 pm Huan Zhang, [email protected] wrote:

@mjmckp https://github.com/mjmckp you can also try this branch and see if it fixes it: https://github.com/Microsoft/LightGBM/tree/gpu_fix I added a few more boundary checks in the GPU code, but I am not sure if this is the problem.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Microsoft/LightGBM/issues/1003#issuecomment-407715785, or mute the thread https://github.com/notifications/unsubscribe-auth/AHaqE9BvpGq8qah1FAY-iP4PAB5tJuBEks5uKE8FgaJpZM4QBF_D .

mjmckp avatar Jul 25 '18 11:07 mjmckp

@mjmckp Thank you for the very detailed debugging log! It seems some counter values are off by 1, however I still have no clue why this happens...

@mjmckp Is the error deterministic (occurs at the same iteration with the same wrong value) each time or it is random? Could you also try to reduce the dataset size and find a minimal dataset that can reproduce this error? Thanks!

huanzhang12 avatar Jul 25 '18 20:07 huanzhang12

I ran it again using a build from the gpu_fix branch, which fails almost immediately (instead of after a while like before). The output is in output3.txt in the dropbox folder.

mjmckp avatar Jul 25 '18 20:07 mjmckp

The file outputs.zip in the Dropbox directory contains the console output from 3 identical runs, using LightGBM compiled from the gpu_fix branch. A diff on the files shows that the program always fails at the same point, however there are small numerical differences in the calculations leading up to this point.

mjmckp avatar Jul 25 '18 22:07 mjmckp

@mjmckp I found that my fix actually introduces another bug, and I just fixed that in the gpu_fix branch. Could you please re-run training and collect console outputs? Thanks!

huanzhang12 avatar Jul 25 '18 23:07 huanzhang12

@mjmckp any news?

Laurae2 avatar Aug 15 '18 18:08 Laurae2

@huanzhang12 It turns out this was an issue with a faulty GPU, this issue can be closed now IMO

mjmckp avatar Aug 28 '18 22:08 mjmckp

@mjmckp Thank you for reporting back that the issue is actually caused by a faulty CPU! LightGBM seems to be a good candidate for GPU stability test :) @lorenzoridolfi Are you still encountering this issue? Can you try to replace GPU and see if still occurs?

huanzhang12 avatar Sep 16 '18 08:09 huanzhang12

@guolinke You mentioned in this issue that high cardinality variables are an issue for GPUs. Is there a way LightGBM could display which variable specifically is giving it problems? Alternatively, how does one check the cardinality of variables? I'm unsure what is meant by that... simply the number of unique categorical values?

jjdelvalle avatar Sep 19 '18 13:09 jjdelvalle

@clinchergt yeah, it is the number of unique categorical values.

guolinke avatar Sep 19 '18 15:09 guolinke

@guolinke How is the number of bins determined? Is it directly correlated with the unique categorical values? How can I determine how many bins a specific variable is gonna need?

jjdelvalle avatar Sep 19 '18 16:09 jjdelvalle

@huanzhang12 What is the fate of the gpu_fix branch? Can this issue be closed?

StrikerRUS avatar Dec 04 '18 12:12 StrikerRUS