LightGBM
LightGBM copied to clipboard
Failed to load Dataset.subset() back after Dataset.save_binary()
Description
I'd like to split a large dataset into several subsets and use them later, via Dataset.subset() API. I really like the LightGBM binary dataset, that saves the memory and disk, so I tried to use it everywhere. But I found the following workflow doesn't work:
- Construct and save a binary dataset.
- Load a subset from the binary dataset.
- Save the subset.
- [Failed] Load the subset back.
Reproducible example
import lightgbm as lgb
import numpy as np
# Create and save the data
data = np.random.random((100,10))
ds = lgb.Dataset(data).construct()
ds.save_binary('train.bin')
# Load, create, and save a subset
ds = lgb.Dataset('train.bin')
subset = ds.subset([1,2,3,5,8]).construct()
print(f'Got {subset.num_data()} samples from {ds.num_data()} samples')
subset.save_binary('subset.bin')
# Load but failed
subset = lgb.Dataset('subset.bin').construct()
Error message:
[LightGBM] [Info] Saving data to binary file train.bin
[LightGBM] [Info] Load from binary file train.bin
Got 5 samples from 100 samples
[LightGBM] [Info] Saving data to binary file subset.bin
[LightGBM] [Info] Load from binary file subset.bin
[LightGBM] [Fatal] Dataset max_bin 140839269 != config 255
Traceback (most recent call last):
File "load_bin.py", line 16, in <module>
subset = lgb.Dataset('subset.bin').construct()
File "D:\yuri_env\lib\site-packages\lightgbm\basic.py", line 1815, in construct
self._lazy_init(self.data, label=self.label,
File "D:\yuri_env\lib\site-packages\lightgbm\basic.py", line 1528, in _lazy_init
_safe_call(_LIB.LGBM_DatasetCreateFromFile(
File "D:\yuri_env\lib\site-packages\lightgbm\basic.py", line 125, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Dataset max_bin 140839269 != config 255
Environment info
LightGBM 3.3.2
Command(s) you used to install LightGBM
pip install lightgbm
Other environments:
- Python 3.8.8
- numpy 1.21.5
Hi @YuriWu, thank you for your interest in LightGBM. I'm not able to reproduce this error, are you sure this example reproduces it?
There is a known error about loading a dataset from a file with non-default parameters that is documented in #4904, do you think that's what you're running into?
I'm sure the example reproduces it. To help troubleshoot further, I created a new virtualenv and only installed lightgbm==3.3.2.
Then I use python test.py
to run the code.
Enviroments
Package Version
------------- -------
joblib 1.1.0
lightgbm 3.3.2
numpy 1.19.5
pip 21.3.1
scikit-learn 0.24.2
scipy 1.5.4
setuptools 59.6.0
threadpoolctl 3.1.0
wheel 0.37.1
Code
Now use range(4*3) to create deterministic toy data, and try to get the first 3 as a subset. Here's the new minimal code:
import lightgbm as lgb
import numpy as np
import os
print('lightgbm version: ', lgb.__version__)
print('numpy version: ', np.__version__)
def subset(data):
# Clean up if exists
files = ['train.bin', 'subset.bin']
for file in files:
if os.path.exists(file):
os.remove(file)
ds = lgb.Dataset(data, params={'data_random_seed': 0}).construct()
ds.save_binary('train.bin')
# Load, create, and save a subset
ds = lgb.Dataset('train.bin').construct()
subset = ds.subset([1,2,3]).construct()
print(f'Got {subset.num_data()} samples from {ds.num_data()} samples')
subset.save_binary('subset.bin')
# Load but failed
subset = lgb.Dataset('subset.bin').construct()
num_rows = 4
num_cols = 3
data = np.array(range( num_rows*num_cols )).reshape(num_rows, num_cols)
print('Data:')
print(data)
print(f'\nSubset of {data.shape} data')
subset(data)
Output
An interesting thing I found is the error msg can be different sometimes, that's why I tried to fix the data_random_seed
.
Here are two possible outputs, they differ in max_bin {N} != config 255
Possible Output 1
$ python test.py
lightgbm version: 3.3.2
numpy version: 1.19.5
Data:
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
Subset of (4, 3) data
[LightGBM] [Warning] There are no meaningful features, as all feature values are constant.
[LightGBM] [Info] Saving data to binary file train.bin
[LightGBM] [Info] Load from binary file train.bin
Got 3 samples from 4 samples
[LightGBM] [Info] Saving data to binary file subset.bin
[LightGBM] [Info] Load from binary file subset.bin
[LightGBM] [Fatal] Dataset max_bin 0 != config 255
Traceback (most recent call last):
File "test.py", line 31, in <module>
subset(data)
File "test.py", line 25, in subset
subset = lgb.Dataset('subset.bin').construct()
File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 1819, in construct
categorical_feature=self.categorical_feature, params=self.params)
File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 1532, in _lazy_init
ctypes.byref(self.handle)))
File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 125, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Dataset max_bin 0 != config 255
Possible Output 2
$ python test.py
lightgbm version: 3.3.2
numpy version: 1.19.5
Data:
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
Subset of (4, 3) data
[LightGBM] [Warning] There are no meaningful features, as all feature values are constant.
[LightGBM] [Info] Saving data to binary file train.bin
[LightGBM] [Info] Load from binary file train.bin
Got 3 samples from 4 samples
[LightGBM] [Info] Saving data to binary file subset.bin
[LightGBM] [Info] Load from binary file subset.bin
[LightGBM] [Fatal] Dataset max_bin 32709 != config 255
Traceback (most recent call last):
File "test.py", line 33, in <module>
subset(data)
File "test.py", line 25, in subset
subset = lgb.Dataset('subset.bin').construct()
File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 1819, in construct
categorical_feature=self.categorical_feature, params=self.params)
File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 1532, in _lazy_init
ctypes.byref(self.handle)))
File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 125, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Dataset max_bin 32709 != config 255
Possible root cause
I guess the problem is due to when LightGBM creates the subset and saves it to binary, it doesn't clean and initialize the headers correctly.
Evidence
I reran the script twice, renamed the subset.bin
to subset_0.bin
and subset_1.bin
, then inspect the binary content, they are different in headers:
xxd subset_0.bin # LightGBMError: Dataset max_bin 0 != config 255
0000000: 5f5f 5f5f 5f5f 4c69 6768 7447 424d 5f42 ______LightGBM_B
0000010: 696e 6172 795f 4669 6c65 5f54 6f6b 656e inary_File_Token
0000020: 5f5f 5f5f 5f5f 0a00 c800 0000 0000 0000 ______..........
0000030: 0300 0000 0000 0000 0000 0000 0000 0000 ................
0000040: 0300 0000 0000 0000 0000 0000 0000 0000 ................
0000050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000080: ffff ffff ffff ffff ffff ffff 0000 0000 ................
0000090: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000a0: ffff ffff ffff ffff ffff ffff 0000 0000 ................
00000b0: 0800 0000 0000 0000 436f 6c75 6d6e 5f30 ........Column_0
00000c0: 0800 0000 0000 0000 436f 6c75 6d6e 5f31 ........Column_1
00000d0: 0800 0000 0000 0000 436f 6c75 6d6e 5f32 ........Column_2
00000e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000f0: 0000 0000 0000 0000 2800 0000 0000 0000 ........(.......
0000100: 0300 0000 0000 0000 0000 0000 0000 0000 ................
0000110: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000120: 0000 0000 0000 0000 ........
xxd subset_1.bin #LightGBMError: Dataset max_bin 32597 != config 255
0000000: 5f5f 5f5f 5f5f 4c69 6768 7447 424d 5f42 ______LightGBM_B
0000010: 696e 6172 795f 4669 6c65 5f54 6f6b 656e inary_File_Token
0000020: 5f5f 5f5f 5f5f 0a00 c800 0000 0000 0000 ______..........
0000030: 0300 0000 0000 0000 0000 0000 0000 0000 ................
0000040: 0300 0000 0000 0000 0000 0000 0000 0000 ................
0000050: 557f 0000 0000 0000 0000 0000 0000 0000 U...............
0000060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000080: ffff ffff ffff ffff ffff ffff 0000 0000 ................
0000090: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000a0: ffff ffff ffff ffff ffff ffff 0000 0000 ................
00000b0: 0800 0000 0000 0000 436f 6c75 6d6e 5f30 ........Column_0
00000c0: 0800 0000 0000 0000 436f 6c75 6d6e 5f31 ........Column_1
00000d0: 0800 0000 0000 0000 436f 6c75 6d6e 5f32 ........Column_2
00000e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000f0: 0000 0000 0000 0000 2800 0000 0000 0000 ........(.......
0000100: 0300 0000 0000 0000 0000 0000 0000 0000 ................
0000110: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0000120: 0000 0000 0000 0000 ........
hex(32597) = 0x7f55, as shown in the later header.
I tried to fix it by copying these lines: https://github.com/microsoft/LightGBM/blob/master/src/io/dataset.cpp#L742-L746
max_bin_ = dataset->max_bin_;
min_data_in_bin_ = dataset->min_data_in_bin_;
bin_construct_sample_cnt_ = dataset->bin_construct_sample_cnt_;
use_missing_ = dataset->use_missing_;
zero_as_missing_ = dataset->zero_as_missing_;
To L736, the end of Dataset::CopyFeatureMapperFrom
And it seems fix the problem after re-compile the .so
Thanks @YuriWu! I was able to reproduce the error and verified the fix you proposed indeed solves the issue. @shiyu1994 @guolinke can you check the proposed fix here?
The fix looks good to me! thank you @YuriWu BTW, should we add a test for it?
@YuriWu would you like to make a PR that includes your fix and a small test?
@jmoralez Sorry, due to the IP policy of my organzation, I'm not allowed to make a PR to open source projects. Please test my proposed fix and merge it if legit.
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.