LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

CLI - Predict metrics, question, saving best model

Open wil70 opened this issue 1 year ago • 11 comments

Hello, I'm using LightGBM CLI. I'm able to train, here is my config file:

task = train
num_threads = 6
objective = multiclass
metric = multi_logloss,auc_mu,multi_error
num_class = 5
metric_freq = 10
is_training_metric = true
header = false
label_column=3489061
categorical_feature=3489061
ignore_column=3489056,3489057,3489058,3489059,3489060
max_bin = 255
data = "c:\train_small.csv"
valid_data= "c:\test_small.csv"
num_trees = 100
learning_rate = 0.1
output_model = "train_model_1.txt"
two_round = true

here are the training traces

...
[LightGBM] [Info] 6723.837641 seconds elapsed, finished iteration 79
[LightGBM] [Info] Iteration:80, training multi_logloss : 0.000167775
[LightGBM] [Info] Iteration:80, training auc_mu : -nan(ind)
[LightGBM] [Info] Iteration:80, training multi_error : 0
[LightGBM] [Info] Iteration:80, valid_1 multi_logloss : 0.593622
[LightGBM] [Info] Iteration:80, valid_1 auc_mu : -nan(ind)
[LightGBM] [Info] Iteration:80, valid_1 multi_error : 0.226773
...
[LightGBM] [Info] Iteration:100, training multi_logloss : 2.4757e-05
[LightGBM] [Info] Iteration:100, training auc_mu : -nan(ind)
[LightGBM] [Info] Iteration:100, training multi_error : 0
[LightGBM] [Info] Iteration:100, valid_1 multi_logloss : 0.653959
[LightGBM] [Info] Iteration:100, valid_1 auc_mu : -nan(ind)
[LightGBM] [Info] Iteration:100, valid_1 multi_error : 0.231768
[LightGBM] [Info] 8309.100039 seconds elapsed, finished iteration 100
[LightGBM] [Info] Finished training

and for predict, I configure it this way:

task = predict
data = "c:\test_small_100.csv"
#data = "c:\test_small_100000.csv"
input_model= "C:\train_model_1.txt.snapshot_iter_80"

3 Questions

  1. how do I automatically save the best model? Must I use the metric_freq = 1?

  2. how do I add metrics to predict to analyze its precision? Predict right now generate the LightGBM_predict_result.txt, but no metrics...I can see the metrics are defined in the LightGBM_predict_result.txt file but I would like to see the metrics for data = "c:\test_small_100.csv" #data = "c:\test_small_100000.csv"

  3. I'm not able to understand the predicted LightGBM_predict_result.txt file content. Here are the first 4 lines :

1.8517091835585531e-15	0.0097533477796164642	0.91078187435773428	0.079464777862645566	1.8517091835585531e-15
5.223358618008706e-15	0.071487969845757365	0.88056063343203761	0.04795139672219461	5.223358618008706e-15
2.5045368136069743e-16	0.7811034753466185	0.21606950953362622	0.0028270151197550058	2.5045368136069743e-16
3.4335294749745184e-16	0.00028905631527815031	0.0069488411409046776	0.99276210254381636	3.4335294749745184e-16

Thanks

--w

wil70 avatar Feb 26 '23 00:02 wil70

Any idea? TY!

wil70 avatar Mar 27 '23 03:03 wil70

Hi @wil70, thanks for using LightGBM.

  1. Do you mean the model with the best metric? If so you can use early stopping.
  2. The metrics are only printed during training, if you want to you can use a different program to compute metrics based on the predictions output file.
  3. The result has one row per observation in your data and one column for each class, where each column represents the probability of that sample being from that class.

Please let us know if you have further doubts.

jmoralez avatar Mar 28 '23 16:03 jmoralez

Thanks @jmoralez

For

  1. Right now I have metric_freq = 10, does the last model output by LightGbm is always the best model? if "yes" then all is good, if "no", then, is there a way to save the best model apart from those saved from the metric_with=10 so I know which one of those is the best?
  2. I can write some code to go over each row and compare to the expected result and do some stats, you mentioned a program that already does this, which one? TY! A "confusion matrix" would be awesome.
  3. Super TY!
  4. I setup the number of threads to the nb of cores I have, I know the documentation mentioned "be aware a task manager or any similar CPU monitoring tool might report that cores not being fully utilized. This is normal" Right now my "Task manager" report 8% CPU which seems to be 100%/12 logical cores (I specified num_threads = 6 cores) image Is there anyway to verify all the 6 cores are really working hard? I'm asking as I'm surprise that my machine is so responsive while running lighgbm.

Thanks for your help --w

wil70 avatar Mar 31 '23 03:03 wil70

  1. The metric_freq is only for printing, so metric_freq = 10 prints every 10 iterations. The model that will be saved is the one having as many iterations as you specified in your parameters (100). If you want to stop training when your metric stops improving you should use early stopping, which will also save the best model (iteration at which it stopped improving).
  2. I meant writing like a python script to do that. You can read the target from your data file and the prediction from the output and just run one of scikit-learn's metrics, for example.
  3. I think there's another tab in the task manager that shows the utilization of each individual core, that may be more useful to you. If you have 6 cores setting num_threads=6 should be fine. Also, since you have many columns, you may want to set force_col_wise=True so that the multithreading happens at the column level.

jmoralez avatar Mar 31 '23 15:03 jmoralez

Thanks a lot!

After some time I see this error, I regenerated the data file but still the same. It take a few days before getting this error. I look inside the file and can not find "na$". I will do another search just in case.

Any idea what I shall do next?

Thanks! w

[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Using column number 3489061 as label
[LightGBM] [Info] Warning: last line of E:\CATS\files\data_analysis\output\train.csv has no end of line, still using this line
[LightGBM] [Fatal] Unknown token na$→∩┐╜∩┐╜m∩┐╜∩┐╜mo∩┐╜l∩┐╜d∩┐╜∩┐╜;∩┐╜%g∩┐╜?w∩┐╜∩┐╜┼╖↓∩┐╜▬∩┐╜∩┐╜ovh0∩┐╜∩┐╜a∩┐╜5∩┐╜∩┐╜*∩┐╜╪Æ∩┐╜∩┐╜l═¢∩┐╜s∩┐╜iy☺∩┐╜r∩┐╜o7∩┐╜∩┐╜∩┐╜∩┐╜%l]∩┐╜∩┐╜%∩┐╜∩┐╜∟∩┐╜hk in data file

wil70 avatar Apr 19 '23 14:04 wil70

That's coming from here https://github.com/microsoft/LightGBM/blob/f74875ed60e696ee7d223ddb409e66f51bddbb47/include/LightGBM/utils/common.h#L341 so it seems like your file has that somewhere.

jmoralez avatar Apr 19 '23 21:04 jmoralez

Super Thanks @jmoralez

I have huge dataset, is there an way to feed the data the follwing way? if not, any idea how I can do it?

Let's say I have a huge (TB) dataset (or a csv) with 2 columns A and B Today I create a csv with let say 6 columns: A (at t0), A(at t-1), A(at t-2), B(at t0), B(at t-1), B(at t-2), and feed those 6 columns to LightGBM. This work great but the csv is huge as it has 6 columns whereas I shall be able to stream a csv with 2 columns, memorize t-1, t-2 for A and B and feed the 6 columns to LightGBM even if my file only has 2 columns.

Is there a way to use this smaller csv (with only 2 columns A and B) and ask LighGBM to use the 2 previous rows as past history. Meaning LightGbm will have A and B at t0, t-1 and t-2?

Thanks

w

wil70 avatar Apr 21 '23 17:04 wil70

That's coming from here

https://github.com/microsoft/LightGBM/blob/f74875ed60e696ee7d223ddb409e66f51bddbb47/include/LightGBM/utils/common.h#L341

so it seems like your file has that somewhere.

Thanks @jmoralez Adding the line number would be super useful for those huge input csv files Log::Fatal("[Line %s] Unknown token %s in data file", lineNb, tmp_str.c_str()); // or maybe [Row %, Col %s] Thanks w

wil70 avatar Apr 29 '23 13:04 wil70

I think you might be able to use grep -n (you can use git bash if you're on windows) on the file to find out the line that has that pattern, e.g.

grep -n python_requires python-package/setup.py
# 352:          python_requires='>=3.6',

jmoralez avatar May 02 '23 23:05 jmoralez

Thanks, it takes way too much time for huge dataset to run lightgbm and then grep (hours, days sometime) that I was able to add "line,column" (I'm not a c++ dev but that seems to work)

Questions: Let's say I have a huge dataset (in TB) with 3 columns A,B and Labels I create a csv with 7 columns out of those 3 columns: like A (at t0), A(at t-1), A(at t-2), B(at t0), B(at t-1), B(at t-2) and Labels and those columns are fed to LightGBM (Note: A at t-1, is just A from previous row)

This dataset is really 3 columns, the other additional 4 columns are just past data of A and B wich takes a lot of file space and memory space once loaded in LightGBM.

Is there a way to use only 3 columns (A and B and Label) and tell LighGBM to use the 2 previous rows of A and B (past data) as input as well?

Thanks

w

wil70 avatar May 08 '23 14:05 wil70

I don't think there's a way to do that in LightGBM itself, however it may be possible with https://github.com/microsoft/lightgbm-transform

jmoralez avatar May 08 '23 16:05 jmoralez

I found the issue, it was realted to the disk was compression via windows and it seems above 400GB things go wrong with .net, once uncompress everything worked well. TY Wil

We can close this issue - TY

wil70 avatar Sep 22 '23 22:09 wil70

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

github-actions[bot] avatar Dec 27 '23 00:12 github-actions[bot]