LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

bin size 257 cannot run on GPU

Open pseudotensor opened this issue 5 years ago • 26 comments

I know there are a couple other issues that mention this problem, but it's gotten messy with suggestions it's related to categorical_feature setting and other stuff. Here is clean MRE.

d9a96c90cb479cef87047ba20517d97982b563eb

lgb257.pkl.zip

import pickle
model, X, y, kwargs = pickle.load(open(lgb257.pkl, "rb"))
model.fit(X, y, **kwargs)

FYI a model.get_params() shows:

params = {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'gain',
          'learning_rate': 0.5, 'max_depth': 6, 'min_child_samples': 1, 'min_child_weight': 1.0, 'min_split_gain': 0.0,
          'n_estimators': 100, 'n_jobs': 8, 'num_leaves': 64, 'objective': 'binary', 'random_state': 1234,
          'reg_alpha': 0.0, 'reg_lambda': 1.0, 'silent': True, 'subsample': 0.7, 'subsample_for_bin': 200000,
          'subsample_freq': 1, 'pred_gap': None, 'pred_periods': None, 'max_bin': 255, 'scale_pos_weight': 1.0,
          'max_delta_step': 0.0, 'min_data_in_bin': 1, 'seed': 1234, 'early_stopping_limit': None, 'device_type': 'gpu',
          'gpu_device_id': 0, 'gpu_platform_id': 0, 'gpu_use_dp': True, 'feature_fraction_seed': 1235,
          'bagging_seed': 1236, 'num_threads': 8, 'num_class': 1, 'verbose': -1, 'categorical_feature': ''}

and FYI here is kwargs:

image

[LightGBM] [Warning] num_threads is set=8, n_jobs=8 will be ignored. Current value: num_threads=8
[LightGBM] [Warning] seed is set=1234, random_state=1234 will be ignored. Current value: seed=1234
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py:1586: UserWarning: Using categorical_feature in Dataset.
  warnings.warn('Using categorical_feature in Dataset.')
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py:1590: UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is []
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py:1108: UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
Please use categorical_feature argument of the Dataset constructor to pass this parameter.
  .format(key))
[LightGBM] [Fatal] bin size 257 cannot run on GPU
Traceback (most recent call last):
  File "/home/jon/h2oai.fullcondatest/h2oaicore/lgb257.py", line 18, in <module>
    model.fit(X, y, **kwargs)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 867, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 637, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 230, in train
    booster = Booster(params=params, train_set=train_set)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2104, in __init__
    ctypes.byref(self.handle)))
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 52, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: bin size 257 cannot run on GPU

Running

model.fit(X, y)

fails same way, but I'm unsure for sklearn API if it is using 'auto' for categorical_feature then.

pseudotensor avatar Mar 18 '21 17:03 pseudotensor

Here is more minimal MRE:

import pickle
X, y = pickle.load(open("lgb257b.pkl", "rb"))

params = dict(categorical_feature='', device_type='gpu', gpu_device_id=0, gpu_platform_id=0, min_data_in_bin=1, max_bin=255)
model = lgb.LGBMClassifier(**params)
model.fit(X, y, categorical_feature='')

FYI gpu_use_dp=True or False has no effect.

That is, I iterated through all parameters, the key to failure is (of course) on GPU but also min_data_in_bin=1. 2 also fails, but 10 does not fail. So lgb is not respecting the max_bin of 255 even for numeric values.

lgb257b.pkl.zip

If this is a user error, I recommend listening primarily to max_bin. E.g. when doing hyperparameter search, fatal failures are not fun to handle. Best if lgb does reasonable thing.

pseudotensor avatar Mar 18 '21 18:03 pseudotensor

Hi, any thoughts? Seems like a clear MRE, but it's been 5 days and no response. Thanks.

pseudotensor avatar Mar 23 '21 06:03 pseudotensor

@guolinke ?

pseudotensor avatar Apr 02 '21 17:04 pseudotensor

  File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/sklearn.py", line 712, in fit
    self._Booster = train(params, train_set,
  File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/engine.py", line 235, in train
    booster = Booster(params=params, train_set=train_set)
  File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/basic.py", line 2528, in __init__
    _safe_call(_LIB.LGBM_BoosterCreate(
  File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: bin size 258 cannot run on GPU

Again, no categorical handling enabled etc.

This is on master as of last night.

pseudotensor avatar Jul 29 '21 21:07 pseudotensor

@guolinke reminder - still the dominant failure mode for LightGBM in Driverless AI

arnocandel avatar Oct 19 '21 15:10 arnocandel

I think the old GPU/CUDA version will be abandoned. also cc @shiyu1994 to follow up on this issue.

guolinke avatar Oct 20 '21 13:10 guolinke

@arnocandel We are updating a branch new CUDA version. Please follow #4630 and #4528 for latest progress.

shiyu1994 avatar Oct 20 '21 13:10 shiyu1994

@shiyu1994 and @guolinke . Hi, Looking at those 2 PRs made me realize that perhaps the current CUDA mode (as opposed to openCL) is incomplete. e.g. you mention categorical handling as added to CUDA version in the PR. Is that correct?

More generally, is the CUDA version incomplete in various ways that are documented? Or does it have (or will have) full parity?

If I run with CUDA version with categorical handling it seems to run fine, but maybe it's not doing what I choose even though I pass categorical_feature?

pseudotensor avatar Oct 20 '21 17:10 pseudotensor

@pseudotensor The current CUDA version is doing the correct thing, it can handle categorical features normally. The only problem is current implementation only do histogram construction on GPU, so the GPU utilization can be low.

Supporting of categorical features is not added yet in our first part of new CUDA version #4630, but will be added later.

shiyu1994 avatar Oct 21 '21 03:10 shiyu1994

Here's another minimal repro, in case helps

lgb.bin257.pkl.zip

import pickle
import lightgbm as lgb
print(lgb.__version__)

from lightgbm.sklearn import LGBMRegressor
with open("lgb.bin257.pkl", "rb") as f:
    X, y = pickle.load(f)
    model = LGBMRegressor(max_bin=252, device_type='gpu')
    model.fit(X, y)
    print("OK1")

    model = LGBMRegressor(max_bin=253, device_type='gpu')
    model.fit(X, y)
    print("OK2")

first one passes, second one fails, not sure where 257 comes from:

3.2.1.99
OK1
[LightGBM] [Fatal] bin size 257 cannot run on GPU
Traceback (most recent call last):
  File "/nfs4/lgb_prefit_1c95733f-58d6-4a61-969f-b2331e03e895.py", line 13, in <module>
    model.fit(X, y)
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/sklearn.py", line 851, in fit
    super().fit(X, y, sample_weight=sample_weight, init_score=init_score,
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/sklearn.py", line 714, in fit
    self._Booster = train(params, train_set,
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/engine.py", line 260, in train
    booster = Booster(params=params, train_set=train_set)
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/basic.py", line 2537, in __init__
    _safe_call(_LIB.LGBM_BoosterCreate(
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: bin size 257 cannot run on GPU

Process finished with exit code 1

arnocandel avatar Nov 30 '21 19:11 arnocandel

Thanks very much @arnocandel !

But are you able to provide a reproducible example starting from raw data in a text-based format, generated from scratch with pandas / numpy / scipy code, or using a widely-distributed dataset like those available in sklearn.datasets?

I personally don't ever load pickle files whose origin I don't know, and I expect others wanting to contribute to fixing this issue might share that hesistation.

From https://docs.python.org/3/library/pickle.html

Warning The pickle module is not secure. Only unpickle data you trust.

It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

jameslamb avatar Dec 28 '21 05:12 jameslamb

@jameslamb - ok use this instead: X_y.zip

import pandas as pd
X=pd.read_csv("X.csv").values
y=pd.read_csv("y.csv").values.ravel()

arnocandel avatar Jan 05 '22 18:01 arnocandel

I'm having the same issue over here!

bin size 257 cannot run on GPU

lewis-morris avatar Jan 12 '22 08:01 lewis-morris

@jameslamb - were you able to check with above two .csv files for X and y?

Here the full thing for simplicity: https://github.com/microsoft/LightGBM/files/7817145/X_y.zip

import lightgbm as lgb
print(lgb.__version__)
import pandas as pd
X=pd.read_csv("X.csv").values
y=pd.read_csv("y.csv").values.ravel()

from lightgbm.sklearn import LGBMRegressor
model = LGBMRegressor(max_bin=252, device_type='gpu')
model.fit(X, y)
print("OK1")

model = LGBMRegressor(max_bin=253, device_type='gpu')
model.fit(X, y)
print("OK2")

arnocandel avatar Feb 24 '22 02:02 arnocandel

were you able to check with above two .csv files for X and y

I was not. If you're subscribed to this issue, you'll be notified when someone picks this up or has new information to share.

jameslamb avatar Feb 24 '22 04:02 jameslamb

this is a bug for lightGBM for GPU,when use CPU,it is OK.

jiluojiluo avatar Jul 11 '22 10:07 jiluojiluo

Any update so far on this issue?

ahmedshahriar avatar Feb 12 '23 04:02 ahmedshahriar

I'm having the same issue :(

lilianabs avatar Feb 12 '23 17:02 lilianabs

same issue too :(

chixujohnny avatar Mar 16 '23 14:03 chixujohnny

Still have this issue.

holma91 avatar Apr 12 '24 18:04 holma91

I have the same issue

"LightGBMError: bin size 1973 cannot run on GPU."

It runs alright using CPU.

matousfamera avatar May 08 '24 15:05 matousfamera

For everyone who encounters this issue with the -DUSE_GPU=ON version of LightGBM, please check our latest GPU version which should be compiled with -DUSE_CUDA=ON. https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version. Thanks.

shiyu1994 avatar May 08 '24 15:05 shiyu1994