LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

[Bug?] [CLI] Inconsistent Results When Resuming Training from a Saved Model

Open wil70 opened this issue 6 months ago • 8 comments

Description

Hello, I have the same resuts if I start 2 times the same training on my big dataset (a bin file). I have different results if I start a new training from a saved model

Details

Run 1, note iteration 5 is [LightGBM] [Info] Iteration:5, training multi_error : 0.256301 [LightGBM] [Info] Iteration:5, valid_1 multi_error : 0.430006

[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Load from binary file wil10_8_data_2004_2006_split_train.csv.bin
[LightGBM] [Warning] Parameter two_round works only in case of loading data directly from text file. It will be ignored when loading from binary file.
[LightGBM] [Info] Finished loading data in 338.806307 seconds
[LightGBM] [Info] Total Bins 278556290
[LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] Start training from score -34.538776
[LightGBM] [Info] Start training from score -34.538776
[LightGBM] [Info] Start training from score -1.568826
[LightGBM] [Info] Start training from score -0.548796
[LightGBM] [Info] Start training from score -1.541474
[LightGBM] [Info] Iteration:1, training multi_error : 0.422355
[LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.428234
[LightGBM] [Info] 24182.741181 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:2, training multi_error : 0.422355
[LightGBM] [Info] Iteration:2, valid_1 multi_error : 0.428234
[LightGBM] [Info] 40458.526595 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:3, training multi_error : 0.367912
[LightGBM] [Info] Iteration:3, valid_1 multi_error : 0.427742
[LightGBM] [Info] 56931.164557 seconds elapsed, finished iteration 3
[LightGBM] [Info] Iteration:4, training multi_error : 0.299521
[LightGBM] [Info] Iteration:4, valid_1 multi_error : 0.426856
[LightGBM] [Info] 73788.227205 seconds elapsed, finished iteration 4
**[LightGBM] [Info] Iteration:5, training multi_error : 0.256301
[LightGBM] [Info] Iteration:5, valid_1 multi_error : 0.430006**
[LightGBM] [Info] 90687.526692 seconds elapsed, finished iteration 5

Run 2: If I set input_model to the moel 4 as a starting point then I get [LightGBM] [Info] Iteration:1, training multi_error : 0.257515 [LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233 Which is very different from iteration 5 above

LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Load from binary file wil10_8_data_2004_2006_split_train.csv.bin
[LightGBM] [Warning] Parameter two_round works only in case of loading data directly from text file. It will be ignored when loading from binary file.
[LightGBM] [Info] Finished loading data in 331.288340 seconds
[LightGBM] [Info] Total Bins 278556290
[LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
**[LightGBM] [Info] Iteration:1, training multi_error : 0.257515
[LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233**
[LightGBM] [Info] 17028.539284 seconds elapsed, finished iteration 1

Run 3: If I run again then I have the same result as in the second run [LightGBM] [Info] Iteration:1, training multi_error : 0.257515 [LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233

> [LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793
> [LightGBM] [Info] Finished initializing training
> [LightGBM] [Info] Started training...
> [LightGBM] [Info] Iteration:1, training multi_error : 0.257515
> [LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233
> [LightGBM] [Info] 28231.119183 seconds elapsed, finished iteration 1

Question: Shouldn't Run 2 and 3 (iteration 1) have the same result as Run 1 (itreation 5)?

Reproducible example

This seems to work for small sample training files. I see this issue with big training files. The model is 111MB The training and validation bin files are 42GB The config file look like this

task = train
num_threads = 12
force_col_wise=true
device_type=cpu
boosting_type = gbdt
objective = multiclass
metric = multi_error
num_class = 5
metric_freq = 1
is_training_metric = true
header = false
max_bin = 255
data = wil10_8_data_2004_2006_split_train.csv.bin
valid_data = wil10_8_data_2004_2006_split_validate.csv.bin
num_trees = 10000
learning_rate = 0.1
output_model = "cpu\train_model_run1.txt"
# input_model = "cpu\train_model_run1.txt.snapshot_iter_4"
two_round = true
snapshot_freq = 1
feature_pre_filter=False
lambda_l1=2.183399274499516e-06
lambda_l2=1.3637583114059118e-08
num_leaves=117
feature_fraction=0.7
bagging_fraction=0.9852124460751488
bagging_freq=5
min_child_samples=20

Environment info

Win 10 Pro + LightGBM CPU mode

LightGBM version or commit hash: SHA-1: 9a76aae1b5979a455bf6d80728d2a88f36823380 From 08/09/24

Command(s) you used to install LightGBM: I compiled it in VS 2022 and use the command line to start LightGBM

Thanks!

  • w

wil70 avatar Aug 14 '24 15:08 wil70