LightGBM Avoiding Exception "Check failed: (best_split_info.right_count)

How you are using LightGBM?

Python package

Environment info

Operating System: Ubuntu 20.04.1 LTS
Python version: 3.8.5
GCC 7.3.0
LightGBM version or commit hash: 3.1.1

Steps to reproduce

In jupyter lab's notebook, prepare train and validation datasets. (They are huge and private, so can't share a reproducible example).
Train lgbm with the data with different sets of features.
Observe an exception looking like this:

Check failed: (best_split_info.right_count) > (0) at [...] Sometimes it says left_count instead of right_count. Other times it doesn't occur at all, depending on the features I use.

Other details

Apparently this is the start of the piece of code initiating the exception: https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L652.

I tried setting min_data_in_leaf to a value greater than zero. It helps sometimes, but not reliably. Same with feature_fraction. I also tried changing min_sum_hessian_in_leaf, to no avail. Also tried setting min_data_in_leaf and min_sum_hessian_in_leaf simultaneously, no difference.

This (or a similar) issue is mentioned a few times here:

https://stackoverflow.com/questions/60161691/best-split-info-check-failure-encountered-while-fitting-lightgbm-classifier
https://github.com/microsoft/LightGBM/issues/3603
https://github.com/microsoft/LightGBM/issues/2742

None of them suggests an approach that allowed me to avoid these exceptions. Would you please share any ideas how to fix this, or at least why does this issue happen at all? If I understand correctly, one could simply trim the split leading to this error and stop branching further. Please correct me if I'm wrong. Thank you.

Dec 25 '20 09:12 ch3rn0v

you can use larger min_data_per_leaf or min_hessian_per_leaf. non-zero may is not enough. Regression objective should be safe in most cases, so I guess you may use the sample weight? If yes, it is better to avoid. And which objective function you used?

Dec 25 '20 10:12 guolinke

you can use larger min_data_per_leaf or min_hessian_per_leaf Sure, I tried a few values. If I increase these too much though it results in the model being under-fitted.

I use "regression" as the value for the "objective" param. Metric is "l1".

you may use the sample weight?

Do you mean specifying different weights for different samples? If so, I do not use this in the example we are discussing here.

Thank you very much for the rapid reply!

Dec 25 '20 10:12 ch3rn0v

If no sample weight and with regression, I think it may due to another problem, not related to min_data and min_hessian. Did it only happen in large-scale data? if yes, I think you can try deterministic=true.

Dec 25 '20 11:12 guolinke

Did it only happen in large-scale data?

Although I didn't conduct the same experiment with a smaller fraction of my dataset, I tried bagging fraction before (perhaps with different set of features) and if I remember correctly it did not result in the above exception. Will try it again, thank you.

you can try deterministic=true.

I appreciate the suggestion! I'm already using this param since your helpful advice in https://github.com/microsoft/LightGBM/issues/3654 :)

Dec 25 '20 11:12 ch3rn0v

interesting, I guess there may be a bug. Did you use missing value handling? By default, it will enable if feature values contain NaN. you can also try use_missing=false .

Dec 25 '20 11:12 guolinke

Did you use missing value handling? By default, it will enable if feature values contain NaN. you can also try use_missing=false.

Yes, I do use missing value handling. Will try with use_missing=false now and report back.

Dec 25 '20 11:12 ch3rn0v

I tried running the same script as before (so without setting min_data_in_leaf, min_sum_hessian_in_leaf) with the only change being the addition of "use_missing": False, to the model's params. And the same exception still occurs.

Any other suggestions in relation to how this can be fixed are very welcome.

Dec 25 '20 13:12 ch3rn0v

@ch3rn0v did you use categorical features?

Dec 25 '20 14:12 guolinke

@guolinke , nope, all features are numerical.

Dec 25 '20 14:12 ch3rn0v

is that possible to provide an exproduce example, by a sub-feature (or even with subrow), so that we can debug with.

Dec 26 '20 03:12 guolinke

I'll start an internal discussion about this, but I doubt any particular data or even a piece of it will be shared.

In the meantime I ran a few other tests.

Tried the same dataset, this time with bagging_fraction and bagging_freq. The exception still happens.
Suppose I have a dataset D1 that works ok. When I add a feature F2, I get an exception. If I keep F2, but remove any single feature from D1, the exception does not happen. So the reason is not adding F2, but rather it's in some interaction between the features.

Dec 26 '20 09:12 ch3rn0v

Interestingly, the error still happens even with "max_depth" and "num_leaves" both being set to zero. Perhaps it occurs during some preliminary data verification?

Dec 26 '20 10:12 ch3rn0v

A potential bug in histogram offset assignment may cause this error. I will create a PR for this.

Dec 26 '20 10:12 shiyu1994

@ch3rn0v Can you please try https://github.com/shiyu1994/LightGBM/tree/fix-3679 to see if the same error occurs?

Dec 28 '20 06:12 shiyu1994

Hello @shiyu1994 , appreciate your rapid response! Do I understand it correctly that the only way to try this version is to do this: https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#linux ? And if so, will I be able to remove this temporary version later? Will it result in any conflict if another version is already installed? Thanks in advance.

Dec 28 '20 07:12 ch3rn0v

@ch3rn0v Yes you have to install the python package by building from the source files as in the link.

If you are using the Python API, you can create a virtualenv or conda to create a new python environment, and install the python package with the branch shiyu1994/fix-3679 in the new environment.

You may also install the python package from the branch directly. If you want to recover a standard released python package of LightGBM, just use pip to remove the branch package, and reinstall the latest released python package.

Dec 28 '20 07:12 shiyu1994

I tried a few different ways to install this version in a new conda env. Alas, none of them worked.

For instance, pip install git+git://github.com/shiyu1994/LightGBM@fix-3679 results in:

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-req-build-[...]/setup.py'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

And yes, I did conda install git pip before that. Searching for any similar errors didn't help much.

I also don't happen to have cmake and can't install it right now.

Would you please suggest any other steps I can take right now, or should I just obtain cmake? Regardless, I'll post an update once I have any news.

Dec 28 '20 10:12 ch3rn0v

Can you please install cmake? You have to build LightGBM before install the python package, when installing from source code.

Dec 28 '20 11:12 shiyu1994

The steps to install python package from source code is git clone --recursive https://github.com/microsoft/LightGBM ; cd LightGBM mkdir build ; cd build cmake .. make -j4 cd ../python-package python setup.py install

Dec 28 '20 11:12 shiyu1994

While I'd be able to test this locally, it'll only make sense to run the experiment on a remote machine that has enough processing power. And I'm unable to install cmake there. While I could build it locally and scp the result to the server, it'd still require python setup.py install or similar. As far as I understand, the latter doesn't guarantee isolation within the current conda's env. And that's something we can't risk. I'm afraid we'll have to wait until this fix is released in order to be able to test it.

I can still run tests locally, but I don't have a dataset tiny enough that still allows to reproduce the bug with the current (3.1.1) release. Apologies for not returning to you with more meaningful feedback.

Dec 28 '20 16:12 ch3rn0v

The steps to install python package from source code is

The last step should be

python setup.py install --precompile

if you'd like Python package installation pick up already compiled dynamic library.

@shiyu1994 Could you please transfer your changes from your fork to this repository? I believe you have enough rights to do this as a collaborator. Then we can trigger Azure Pipelines to build Python wheel file with your changes. And after that @ch3rn0v will be able to install patched version with simple pip install ... in isolated env without any other requirements.

Dec 28 '20 22:12 StrikerRUS

Another option will be simply find current LightGBM installation folder, rename lib_lightgbm.so file to something like lib_lightgbm_backup.so and download only patched dynamic library file instead of the whole wheel in case you are not able to take risks of not fully isolated environments. It will work because as I can see fix includes changes only at cpp side but doesn't touch Python wrapper.

Dec 29 '20 22:12 StrikerRUS

Same issue occurred in GPU lightgbm.

In my case, if i do not use both max_depth, num_leaves params together, and use only num_leaves params (max_depth as default), error doesn't come out.

Hope this bug fixed soon,

Dec 30 '20 05:12 jungsooyun

#3694 is opened to potentially fix these errors, but it is only related to CPU version. We need further investigation if the errors are not fully eliminated after this PR is merged.

Dec 30 '20 06:12 shiyu1994

@shiyu1994 Just for your information.

I also got a similar error and then I google on the web and found this issue.

I am using LightGBM 3.1.1 (the version that I can install from "pip3 install lightgbm") I run it with missing_data=True, regression task, least-square error, no GPU, with categorical features

I got the following error at some point:

lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /__w/1/s/python-package/compile/src/treelearner/serial_tree_learner.cpp, line 651 .

I saw #3694 had been merged. Therefore, I compile the latest version from github master and it currently works.

My data is also private and cannot be shared. Sorry about that.

Jan 17 '21 11:01 wonghang

Hi @guolinke I hit the same problem:

 File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 794, in fit
    categorical_feature=categorical_feature, callbacks=callbacks, init_model=init_model)
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 637, in fit
    callbacks=callbacks, init_model=init_model)
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 251, in train
    booster.update(fobj=fobj)
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2505, in update
    ctypes.byref(is_finished)))
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 52, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /workspace/LightGBM/src/treelearner/serial_tree_learner.cpp, line 653 .

Happens when trying to use mape on simple random data.

Jan 25 '21 20:01 pseudotensor

Hi @pseudotensor, are you using the released version of LightGBM or building from source?

Jan 26 '21 03:01 shiyu1994

Building from source like:

    rm -rf build ; mkdir -p build ; cd build && \
    cmake $(GPU_FLAG) $(CUDA_FLAG) -DCMAKE_INSTALL_PREFIX=$$PYTHONPREFIX -DBoost_NO_BOOST_CMAKE=TRUE -DBoost_NO_SYSTEM_PATHS=TRUE -DBOOST_ROOT:PATHNAME=$$BOOSTPREFIX -DBoost_LIBRARY_DIRS:FILEPATH=$BOOSTPREFIX/lib -DOpenCL_LIBRARY=$$CUDA_HOME/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=$$CUDA_HOME/include/ -DBoost_USE_STATIC_LIBS=ON .. && \
    make -j 8 && \
	cd ../python-package && rm -rf dist && \
    $(PYTHON) setup.py bdist_wheel --precompile --gpu --cuda --hdfs

(the --gpu etc. options aren't really needed since --precompile is shtere)

Note that I only started seeing this problem when upgrading from 2.2.4 to master.

I'm trying to repro the event seen in our jenkins testing, but so far no luck.

Jan 26 '21 06:01 pseudotensor

@pseudotensor Can you please try version 3.0.0 to see if the same problem occurs. Is your training data private? If not, can you share it with us? Thanks.

Jan 26 '21 10:01 shiyu1994

I've encountered the same problem as the topic starter @ch3rn0v. The problem arises with a particular set of features and disappears if adding or removing at least one feature. Also, I tried to change 'max_bin', and in some cases, it helped to solve the problem, but this way is not reliable.

Feb 02 '21 09:02 AndreyRysin

LightGBM
LightGBM copied to clipboard

Avoiding Exception "Check failed: (best_split_info.right_count) > (0) at ..." with a regression task

How you are using LightGBM?

Environment info

Steps to reproduce

Other details

LightGBM LightGBM copied to clipboard

Avoiding Exception "Check failed: (best_split_info.right_count) > (0) at ..." with a regression task

How you are using LightGBM?

Environment info

Steps to reproduce

Other details

LightGBM
LightGBM copied to clipboard