LightGBM
LightGBM copied to clipboard
[Bug?] [CLI] Inconsistent Results When Resuming Training from a Saved Model
Description
Hello, I have the same resuts if I start 2 times the same training on my big dataset (a bin file). I have different results if I start a new training from a saved model
Details
Run 1, note iteration 5 is [LightGBM] [Info] Iteration:5, training multi_error : 0.256301 [LightGBM] [Info] Iteration:5, valid_1 multi_error : 0.430006
[LightGBM] [Info] Finished loading parameters [LightGBM] [Info] Load from binary file wil10_8_data_2004_2006_split_train.csv.bin [LightGBM] [Warning] Parameter two_round works only in case of loading data directly from text file. It will be ignored when loading from binary file. [LightGBM] [Info] Finished loading data in 338.806307 seconds [LightGBM] [Info] Total Bins 278556290 [LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793 [LightGBM] [Info] Finished initializing training [LightGBM] [Info] Started training... [LightGBM] [Info] Start training from score -34.538776 [LightGBM] [Info] Start training from score -34.538776 [LightGBM] [Info] Start training from score -1.568826 [LightGBM] [Info] Start training from score -0.548796 [LightGBM] [Info] Start training from score -1.541474 [LightGBM] [Info] Iteration:1, training multi_error : 0.422355 [LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.428234 [LightGBM] [Info] 24182.741181 seconds elapsed, finished iteration 1 [LightGBM] [Info] Iteration:2, training multi_error : 0.422355 [LightGBM] [Info] Iteration:2, valid_1 multi_error : 0.428234 [LightGBM] [Info] 40458.526595 seconds elapsed, finished iteration 2 [LightGBM] [Info] Iteration:3, training multi_error : 0.367912 [LightGBM] [Info] Iteration:3, valid_1 multi_error : 0.427742 [LightGBM] [Info] 56931.164557 seconds elapsed, finished iteration 3 [LightGBM] [Info] Iteration:4, training multi_error : 0.299521 [LightGBM] [Info] Iteration:4, valid_1 multi_error : 0.426856 [LightGBM] [Info] 73788.227205 seconds elapsed, finished iteration 4 **[LightGBM] [Info] Iteration:5, training multi_error : 0.256301 [LightGBM] [Info] Iteration:5, valid_1 multi_error : 0.430006** [LightGBM] [Info] 90687.526692 seconds elapsed, finished iteration 5
Run 2: If I set input_model to the moel 4 as a starting point then I get [LightGBM] [Info] Iteration:1, training multi_error : 0.257515 [LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233 Which is very different from iteration 5 above
LightGBM] [Info] Finished loading parameters [LightGBM] [Info] Load from binary file wil10_8_data_2004_2006_split_train.csv.bin [LightGBM] [Warning] Parameter two_round works only in case of loading data directly from text file. It will be ignored when loading from binary file. [LightGBM] [Info] Finished loading data in 331.288340 seconds [LightGBM] [Info] Total Bins 278556290 [LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793 [LightGBM] [Info] Finished initializing training [LightGBM] [Info] Started training... **[LightGBM] [Info] Iteration:1, training multi_error : 0.257515 [LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233** [LightGBM] [Info] 17028.539284 seconds elapsed, finished iteration 1
Run 3: If I run again then I have the same result as in the second run [LightGBM] [Info] Iteration:1, training multi_error : 0.257515 [LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233
> [LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793
> [LightGBM] [Info] Finished initializing training
> [LightGBM] [Info] Started training...
> [LightGBM] [Info] Iteration:1, training multi_error : 0.257515
> [LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233
> [LightGBM] [Info] 28231.119183 seconds elapsed, finished iteration 1
Question: Shouldn't Run 2 and 3 (iteration 1) have the same result as Run 1 (itreation 5)?
Reproducible example
This seems to work for small sample training files. I see this issue with big training files. The model is 111MB The training and validation bin files are 42GB The config file look like this
task = train num_threads = 12 force_col_wise=true device_type=cpu boosting_type = gbdt objective = multiclass metric = multi_error num_class = 5 metric_freq = 1 is_training_metric = true header = false max_bin = 255 data = wil10_8_data_2004_2006_split_train.csv.bin valid_data = wil10_8_data_2004_2006_split_validate.csv.bin num_trees = 10000 learning_rate = 0.1 output_model = "cpu\train_model_run1.txt" # input_model = "cpu\train_model_run1.txt.snapshot_iter_4" two_round = true snapshot_freq = 1 feature_pre_filter=False lambda_l1=2.183399274499516e-06 lambda_l2=1.3637583114059118e-08 num_leaves=117 feature_fraction=0.7 bagging_fraction=0.9852124460751488 bagging_freq=5 min_child_samples=20
Environment info
Win 10 Pro + LightGBM CPU mode
LightGBM version or commit hash: SHA-1: 9a76aae1b5979a455bf6d80728d2a88f36823380 From 08/09/24
Command(s) you used to install LightGBM: I compiled it in VS 2022 and use the command line to start LightGBM
Thanks!
- w