LightGBM
LightGBM copied to clipboard
Avoiding Exception "Check failed: (best_split_info.right_count) > (0) at ..." with a regression task
How you are using LightGBM?
- Python package
Environment info
- Operating System: Ubuntu 20.04.1 LTS
- Python version: 3.8.5
- GCC 7.3.0
- LightGBM version or commit hash: 3.1.1
Steps to reproduce
- In jupyter lab's notebook, prepare train and validation datasets. (They are huge and private, so can't share a reproducible example).
- Train lgbm with the data with different sets of features.
- Observe an exception looking like this:
Check failed: (best_split_info.right_count) > (0) at [...]
Sometimes it says left_count
instead of right_count
.
Other times it doesn't occur at all, depending on the features I use.
Other details
Apparently this is the start of the piece of code initiating the exception: https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L652.
I tried setting min_data_in_leaf
to a value greater than zero. It helps sometimes, but not reliably. Same with feature_fraction
. I also tried changing min_sum_hessian_in_leaf
, to no avail. Also tried setting min_data_in_leaf
and min_sum_hessian_in_leaf
simultaneously, no difference.
This (or a similar) issue is mentioned a few times here:
- https://stackoverflow.com/questions/60161691/best-split-info-check-failure-encountered-while-fitting-lightgbm-classifier
- https://github.com/microsoft/LightGBM/issues/3603
- https://github.com/microsoft/LightGBM/issues/2742
None of them suggests an approach that allowed me to avoid these exceptions. Would you please share any ideas how to fix this, or at least why does this issue happen at all? If I understand correctly, one could simply trim the split leading to this error and stop branching further. Please correct me if I'm wrong. Thank you.
you can use larger min_data_per_leaf
or min_hessian_per_leaf
. non-zero may is not enough.
Regression objective should be safe in most cases, so I guess you may use the sample weight? If yes, it is better to avoid.
And which objective function you used?
you can use larger
min_data_per_leaf
ormin_hessian_per_leaf
Sure, I tried a few values. If I increase these too much though it results in the model being under-fitted.
I use "regression"
as the value for the "objective"
param. Metric is "l1"
.
you may use the sample weight?
Do you mean specifying different weights for different samples? If so, I do not use this in the example we are discussing here.
Thank you very much for the rapid reply!
If no sample weight and with regression
, I think it may due to another problem, not related to min_data
and min_hessian
.
Did it only happen in large-scale data? if yes, I think you can try deterministic=true
.
Did it only happen in large-scale data?
Although I didn't conduct the same experiment with a smaller fraction of my dataset, I tried bagging fraction before (perhaps with different set of features) and if I remember correctly it did not result in the above exception. Will try it again, thank you.
you can try
deterministic=true
.
I appreciate the suggestion! I'm already using this param since your helpful advice in https://github.com/microsoft/LightGBM/issues/3654 :)
interesting, I guess there may be a bug.
Did you use missing value handling? By default, it will enable if feature values contain NaN.
you can also try use_missing=false
.
Did you use missing value handling? By default, it will enable if feature values contain NaN. you can also try
use_missing=false
.
Yes, I do use missing value handling. Will try with use_missing=false
now and report back.
I tried running the same script as before (so without setting min_data_in_leaf
, min_sum_hessian_in_leaf
) with the only change being the addition of "use_missing": False,
to the model's params. And the same exception still occurs.
Any other suggestions in relation to how this can be fixed are very welcome.
@ch3rn0v did you use categorical features?
@guolinke , nope, all features are numerical.
is that possible to provide an exproduce example, by a sub-feature (or even with subrow), so that we can debug with.
I'll start an internal discussion about this, but I doubt any particular data or even a piece of it will be shared.
In the meantime I ran a few other tests.
- Tried the same dataset, this time with
bagging_fraction
andbagging_freq
. The exception still happens. - Suppose I have a dataset D1 that works ok. When I add a feature F2, I get an exception. If I keep F2, but remove any single feature from D1, the exception does not happen. So the reason is not adding F2, but rather it's in some interaction between the features.
Interestingly, the error still happens even with "max_depth"
and "num_leaves"
both being set to zero. Perhaps it occurs during some preliminary data verification?
A potential bug in histogram offset assignment may cause this error. I will create a PR for this.
@ch3rn0v Can you please try https://github.com/shiyu1994/LightGBM/tree/fix-3679 to see if the same error occurs?
Hello @shiyu1994 , appreciate your rapid response! Do I understand it correctly that the only way to try this version is to do this: https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#linux ? And if so, will I be able to remove this temporary version later? Will it result in any conflict if another version is already installed? Thanks in advance.
@ch3rn0v Yes you have to install the python package by building from the source files as in the link.
If you are using the Python API, you can create a virtualenv
or conda
to create a new python environment, and install the python package with the branch shiyu1994/fix-3679 in the new environment.
You may also install the python package from the branch directly. If you want to recover a standard released python package of LightGBM, just use pip
to remove the branch package, and reinstall the latest released python package.
I tried a few different ways to install this version in a new conda env. Alas, none of them worked.
For instance, pip install git+git://github.com/shiyu1994/LightGBM@fix-3679
results in:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-req-build-[...]/setup.py'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
And yes, I did conda install git pip
before that. Searching for any similar errors didn't help much.
I also don't happen to have cmake
and can't install it right now.
Would you please suggest any other steps I can take right now, or should I just obtain cmake
?
Regardless, I'll post an update once I have any news.
Can you please install cmake
? You have to build LightGBM before install the python package, when installing from source code.
The steps to install python package from source code is
git clone --recursive https://github.com/microsoft/LightGBM ; cd LightGBM
mkdir build ; cd build
cmake ..
make -j4
cd ../python-package
python setup.py install
While I'd be able to test this locally, it'll only make sense to run the experiment on a remote machine that has enough processing power. And I'm unable to install cmake
there. While I could build it locally and scp
the result to the server, it'd still require python setup.py install
or similar. As far as I understand, the latter doesn't guarantee isolation within the current conda's env. And that's something we can't risk. I'm afraid we'll have to wait until this fix is released in order to be able to test it.
I can still run tests locally, but I don't have a dataset tiny enough that still allows to reproduce the bug with the current (3.1.1) release. Apologies for not returning to you with more meaningful feedback.
The steps to install python package from source code is
The last step should be
python setup.py install --precompile
if you'd like Python package installation pick up already compiled dynamic library.
@shiyu1994 Could you please transfer your changes from your fork to this repository? I believe you have enough rights to do this as a collaborator. Then we can trigger Azure Pipelines to build Python wheel file with your changes. And after that @ch3rn0v will be able to install patched version with simple pip install ...
in isolated env without any other requirements.
Another option will be simply find current LightGBM installation folder, rename lib_lightgbm.so
file to something like lib_lightgbm_backup.so
and download only patched dynamic library file instead of the whole wheel in case you are not able to take risks of not fully isolated environments. It will work because as I can see fix includes changes only at cpp side but doesn't touch Python wrapper.
Same issue occurred in GPU lightgbm.
In my case, if i do not use both max_depth
, num_leaves
params together, and use only num_leaves
params (max_depth as default), error doesn't come out.
Hope this bug fixed soon,
#3694 is opened to potentially fix these errors, but it is only related to CPU version. We need further investigation if the errors are not fully eliminated after this PR is merged.
@shiyu1994 Just for your information.
I also got a similar error and then I google on the web and found this issue.
I am using LightGBM 3.1.1 (the version that I can install from "pip3 install lightgbm") I run it with missing_data=True, regression task, least-square error, no GPU, with categorical features
I got the following error at some point:
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /__w/1/s/python-package/compile/src/treelearner/serial_tree_learner.cpp, line 651 .
I saw #3694 had been merged. Therefore, I compile the latest version from github master and it currently works.
My data is also private and cannot be shared. Sorry about that.
Hi @guolinke I hit the same problem:
File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 794, in fit
categorical_feature=categorical_feature, callbacks=callbacks, init_model=init_model)
File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 637, in fit
callbacks=callbacks, init_model=init_model)
File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 251, in train
booster.update(fobj=fobj)
File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2505, in update
ctypes.byref(is_finished)))
File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 52, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /workspace/LightGBM/src/treelearner/serial_tree_learner.cpp, line 653 .
Happens when trying to use mape
on simple random data.
Hi @pseudotensor, are you using the released version of LightGBM or building from source?
Building from source like:
rm -rf build ; mkdir -p build ; cd build && \
cmake $(GPU_FLAG) $(CUDA_FLAG) -DCMAKE_INSTALL_PREFIX=$$PYTHONPREFIX -DBoost_NO_BOOST_CMAKE=TRUE -DBoost_NO_SYSTEM_PATHS=TRUE -DBOOST_ROOT:PATHNAME=$$BOOSTPREFIX -DBoost_LIBRARY_DIRS:FILEPATH=$BOOSTPREFIX/lib -DOpenCL_LIBRARY=$$CUDA_HOME/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=$$CUDA_HOME/include/ -DBoost_USE_STATIC_LIBS=ON .. && \
make -j 8 && \
cd ../python-package && rm -rf dist && \
$(PYTHON) setup.py bdist_wheel --precompile --gpu --cuda --hdfs
(the --gpu etc. options aren't really needed since --precompile is shtere)
Note that I only started seeing this problem when upgrading from 2.2.4 to master.
I'm trying to repro the event seen in our jenkins testing, but so far no luck.
@pseudotensor Can you please try version 3.0.0 to see if the same problem occurs. Is your training data private? If not, can you share it with us? Thanks.
I've encountered the same problem as the topic starter @ch3rn0v. The problem arises with a particular set of features and disappears if adding or removing at least one feature. Also, I tried to change 'max_bin', and in some cases, it helped to solve the problem, but this way is not reliable.