LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

Failed to load Dataset.subset() back after Dataset.save_binary()

Open YuriWu opened this issue 2 years ago • 6 comments

Description

I'd like to split a large dataset into several subsets and use them later, via Dataset.subset() API. I really like the LightGBM binary dataset, that saves the memory and disk, so I tried to use it everywhere. But I found the following workflow doesn't work:

  1. Construct and save a binary dataset.
  2. Load a subset from the binary dataset.
  3. Save the subset.
  4. [Failed] Load the subset back.

Reproducible example

import lightgbm as lgb
import numpy as np

# Create and save the data
data = np.random.random((100,10))
ds = lgb.Dataset(data).construct()
ds.save_binary('train.bin')

# Load, create, and save a subset
ds = lgb.Dataset('train.bin')
subset = ds.subset([1,2,3,5,8]).construct()
print(f'Got {subset.num_data()} samples from {ds.num_data()} samples')
subset.save_binary('subset.bin')

# Load but failed
subset = lgb.Dataset('subset.bin').construct()

Error message:

[LightGBM] [Info] Saving data to binary file train.bin
[LightGBM] [Info] Load from binary file train.bin
Got 5 samples from 100 samples
[LightGBM] [Info] Saving data to binary file subset.bin
[LightGBM] [Info] Load from binary file subset.bin
[LightGBM] [Fatal] Dataset max_bin 140839269 != config 255
Traceback (most recent call last):
  File "load_bin.py", line 16, in <module>
    subset = lgb.Dataset('subset.bin').construct()
  File "D:\yuri_env\lib\site-packages\lightgbm\basic.py", line 1815, in construct
    self._lazy_init(self.data, label=self.label,
  File "D:\yuri_env\lib\site-packages\lightgbm\basic.py", line 1528, in _lazy_init
    _safe_call(_LIB.LGBM_DatasetCreateFromFile(
  File "D:\yuri_env\lib\site-packages\lightgbm\basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Dataset max_bin 140839269 != config 255

Environment info

LightGBM 3.3.2

Command(s) you used to install LightGBM

pip install lightgbm

Other environments:

  • Python 3.8.8
  • numpy 1.21.5

YuriWu avatar Aug 04 '22 12:08 YuriWu

Hi @YuriWu, thank you for your interest in LightGBM. I'm not able to reproduce this error, are you sure this example reproduces it?

There is a known error about loading a dataset from a file with non-default parameters that is documented in #4904, do you think that's what you're running into?

jmoralez avatar Aug 07 '22 03:08 jmoralez

I'm sure the example reproduces it. To help troubleshoot further, I created a new virtualenv and only installed lightgbm==3.3.2. Then I use python test.py to run the code.

Enviroments

Package       Version
------------- -------
joblib        1.1.0
lightgbm      3.3.2
numpy         1.19.5
pip           21.3.1
scikit-learn  0.24.2
scipy         1.5.4
setuptools    59.6.0
threadpoolctl 3.1.0
wheel         0.37.1

Code

Now use range(4*3) to create deterministic toy data, and try to get the first 3 as a subset. Here's the new minimal code:

import lightgbm as lgb
import numpy as np
import os

print('lightgbm version: ', lgb.__version__)
print('numpy version: ', np.__version__)

def subset(data):
    # Clean up if exists
    files = ['train.bin', 'subset.bin']
    for file in files:
        if os.path.exists(file):
            os.remove(file)

    ds = lgb.Dataset(data, params={'data_random_seed': 0}).construct()
    ds.save_binary('train.bin')
    
    # Load, create, and save a subset
    ds = lgb.Dataset('train.bin').construct()
    subset = ds.subset([1,2,3]).construct()
    print(f'Got {subset.num_data()} samples from {ds.num_data()} samples')
    subset.save_binary('subset.bin')

    # Load but failed
    subset = lgb.Dataset('subset.bin').construct()

num_rows = 4
num_cols = 3
data = np.array(range( num_rows*num_cols )).reshape(num_rows, num_cols)
print('Data:')
print(data)
print(f'\nSubset of {data.shape} data')
subset(data)

Output

An interesting thing I found is the error msg can be different sometimes, that's why I tried to fix the data_random_seed. Here are two possible outputs, they differ in max_bin {N} != config 255

Possible Output 1

$ python test.py
lightgbm version:  3.3.2
numpy version:  1.19.5
Data:
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]

Subset of (4, 3) data
[LightGBM] [Warning] There are no meaningful features, as all feature values are constant.
[LightGBM] [Info] Saving data to binary file train.bin
[LightGBM] [Info] Load from binary file train.bin
Got 3 samples from 4 samples
[LightGBM] [Info] Saving data to binary file subset.bin
[LightGBM] [Info] Load from binary file subset.bin
[LightGBM] [Fatal] Dataset max_bin 0 != config 255
Traceback (most recent call last):
  File "test.py", line 31, in <module>
    subset(data)
  File "test.py", line 25, in subset
    subset = lgb.Dataset('subset.bin').construct()
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 1819, in construct
    categorical_feature=self.categorical_feature, params=self.params)
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 1532, in _lazy_init
    ctypes.byref(self.handle)))
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Dataset max_bin 0 != config 255

Possible Output 2

$ python test.py
lightgbm version:  3.3.2
numpy version:  1.19.5
Data:
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]

Subset of (4, 3) data
[LightGBM] [Warning] There are no meaningful features, as all feature values are constant.
[LightGBM] [Info] Saving data to binary file train.bin
[LightGBM] [Info] Load from binary file train.bin
Got 3 samples from 4 samples
[LightGBM] [Info] Saving data to binary file subset.bin
[LightGBM] [Info] Load from binary file subset.bin
[LightGBM] [Fatal] Dataset max_bin 32709 != config 255
Traceback (most recent call last):
  File "test.py", line 33, in <module>
    subset(data)
  File "test.py", line 25, in subset
    subset = lgb.Dataset('subset.bin').construct()
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 1819, in construct
    categorical_feature=self.categorical_feature, params=self.params)
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 1532, in _lazy_init
    ctypes.byref(self.handle)))
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Dataset max_bin 32709 != config 255

Possible root cause

I guess the problem is due to when LightGBM creates the subset and saves it to binary, it doesn't clean and initialize the headers correctly.

Evidence

I reran the script twice, renamed the subset.bin to subset_0.bin and subset_1.bin, then inspect the binary content, they are different in headers:

xxd subset_0.bin # LightGBMError: Dataset max_bin 0 != config 255
0000000: 5f5f 5f5f 5f5f 4c69 6768 7447 424d 5f42  ______LightGBM_B
0000010: 696e 6172 795f 4669 6c65 5f54 6f6b 656e  inary_File_Token
0000020: 5f5f 5f5f 5f5f 0a00 c800 0000 0000 0000  ______..........
0000030: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000040: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000080: ffff ffff ffff ffff ffff ffff 0000 0000  ................
0000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000a0: ffff ffff ffff ffff ffff ffff 0000 0000  ................
00000b0: 0800 0000 0000 0000 436f 6c75 6d6e 5f30  ........Column_0
00000c0: 0800 0000 0000 0000 436f 6c75 6d6e 5f31  ........Column_1
00000d0: 0800 0000 0000 0000 436f 6c75 6d6e 5f32  ........Column_2
00000e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000f0: 0000 0000 0000 0000 2800 0000 0000 0000  ........(.......
0000100: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000110: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000120: 0000 0000 0000 0000                      ........
xxd subset_1.bin #LightGBMError: Dataset max_bin 32597 != config 255
0000000: 5f5f 5f5f 5f5f 4c69 6768 7447 424d 5f42  ______LightGBM_B
0000010: 696e 6172 795f 4669 6c65 5f54 6f6b 656e  inary_File_Token
0000020: 5f5f 5f5f 5f5f 0a00 c800 0000 0000 0000  ______..........
0000030: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000040: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000050: 557f 0000 0000 0000 0000 0000 0000 0000  U...............
0000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000080: ffff ffff ffff ffff ffff ffff 0000 0000  ................
0000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000a0: ffff ffff ffff ffff ffff ffff 0000 0000  ................
00000b0: 0800 0000 0000 0000 436f 6c75 6d6e 5f30  ........Column_0
00000c0: 0800 0000 0000 0000 436f 6c75 6d6e 5f31  ........Column_1
00000d0: 0800 0000 0000 0000 436f 6c75 6d6e 5f32  ........Column_2
00000e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000f0: 0000 0000 0000 0000 2800 0000 0000 0000  ........(.......
0000100: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000110: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000120: 0000 0000 0000 0000                      ........

hex(32597) = 0x7f55, as shown in the later header.

YuriWu avatar Aug 08 '22 08:08 YuriWu

I tried to fix it by copying these lines: https://github.com/microsoft/LightGBM/blob/master/src/io/dataset.cpp#L742-L746

  max_bin_ = dataset->max_bin_;
  min_data_in_bin_ = dataset->min_data_in_bin_;
  bin_construct_sample_cnt_ = dataset->bin_construct_sample_cnt_;
  use_missing_ = dataset->use_missing_;
  zero_as_missing_ = dataset->zero_as_missing_;

To L736, the end of Dataset::CopyFeatureMapperFrom And it seems fix the problem after re-compile the .so

YuriWu avatar Aug 08 '22 09:08 YuriWu

Thanks @YuriWu! I was able to reproduce the error and verified the fix you proposed indeed solves the issue. @shiyu1994 @guolinke can you check the proposed fix here?

jmoralez avatar Aug 08 '22 23:08 jmoralez

The fix looks good to me! thank you @YuriWu BTW, should we add a test for it?

guolinke avatar Aug 09 '22 11:08 guolinke

@YuriWu would you like to make a PR that includes your fix and a small test?

jmoralez avatar Aug 09 '22 15:08 jmoralez

@jmoralez Sorry, due to the IP policy of my organzation, I'm not allowed to make a PR to open source projects. Please test my proposed fix and merge it if legit.

YuriWu avatar Aug 11 '22 13:08 YuriWu

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

github-actions[bot] avatar Aug 19 '23 03:08 github-actions[bot]