Upgrading to 1.0.0-rc2 results in a large drop in classification performance using LightGBMClassifier.
Describe the bug
Updating mmlspark from 1.0.0-rc1-51-df0244c7-SNAPSHOT to 1.0.0-rc2, while keeping all other aspects of my code the same, results in a large drop in validation Average Precision when using LightGBMClassifier: from 0.574 to 0.313
params = {
'num_trees': 1000,
'early_stopping_rounds': 0,
'feature_fraction': 0.7,
'l1_reg': 0.0,
'l2_reg': 0.0,
'max_depth': -1,
'num_leaves': 31,
'is_unbalance': True
}
lgb = LightGBMClassifier(
featuresCol='features',
labelCol='Label',
slotNames=features,
categoricalSlotNames=idx_cat_cols,
timeout=12000.0,
useBarrierExecutionMode=True,
numIterations=params['num_trees'],
isUnbalance=params['is_unbalance'],
earlyStoppingRound=params['early_stopping_rounds'],
featureFraction=params['feature_fraction'],
lambdaL1=params['l1_reg'],
lambdaL2=params['l2_reg'],
maxDepth=params['max_depth'],
numLeaves=params['num_leaves']
)
To Reproduce I am seeing this result on a private dataset with 140,000,000 rows and 130 feature columns. I am a Microsoft employee so we can talk offline if more details are needed.
Expected behavior Comparable validation performance between versions.
Info (please complete the following information):
- MMLSpark Version:
1.0.0-rc2 - Spark Version:
2.4.5 - Spark Platform:
Databricks (runtime 6.6 ML)
If the bug pertains to a specific feature please tag the appropriate CODEOWNER for better visibility @imatiach-msft
Additional context Did any underlying default settings change?
Same problem here with the LightGBMRegressor in 1.0.0-rc2, Spark 2.4.5 and Databricks runtime 6.6 ML. Complete garbage predictions in my use case, while rc1 was running fine.
@tyler-romero @brunocous this is very concerning, I wonder if some new parameter that was added might be causing this difference? Any way to get a repro of the issue?
@brunocous do you have unbalanced data too? I reached out to @tyler-romero , he is working on a repro, but his parameters seemed pretty standard other than that his data was unbalanced. I wonder if there might have been some regression for unbalanced data? - just my intuition at this point.
@imatiach-msft I'll see if I can reproduce the error on Databricks with a toy dataset.
My specific regression application is a predicting a delay, where 95% of entities are on time. Although the real business value lies in predicting the other 5% accurately. So yes, it is unbalanced in some sense?
My parameters are:
"objective": "regression",
"learningRate": 0.5,
"numLeaves": 150,
"lambdaL1": 0.2,
"lambdaL2": 0.9,
"maxDepth": 35,
"boostingType": "goss",
"earlyStoppingRound": 1,
"numIterations": 110,
Besides that I assign a larger weight to delayed samples such that my model picks it up more easily. I tried to play with all of them using hyperparameter tuning frameworks, but not a single combination could achieve the level of performance of the rc1 version.
@brunocous ah, sorry, you are using regressor and not classifier, I should have noticed that - sure it's unbalanced but not in the classification sense of a majority/minority class. Interesting that you are also seeing this issue for the regressor. We do have unit tests that validate metrics like accuracy, so I would have thought they would catch that. I wonder what went wrong with the rc2 release.
Same problems here with the LightGBMClassifier in 1.0.0-rc2, Spark 2.4.3 on GCP. An unbalanced binary classification problem with ~2M instances and ~130 features. To test, I used default hyperparams with isUnbalance=True and earlyStoppingRound=10. In rc1, logloss drops steadily to ~0.5 and early stopping kicks in at iteration ~30 when logloss slightly grows to ~0.55. In rc2 for exact same code, in iteration ~10 logloss starts exploding growing rapidly from ~0.55 to ~1.5 in a couple of iterations. naturally results are suboptimal with AUC on validation folds all over the place (0.7-0.83) whereas in rc1 in all folds have AUC ~0.84.
EDIT: I checked a locally compiled version 1.0.0-rc1-82-82e7a8eb and it does not suffer from the issues mentioned above.
@samins "I checked a locally compiled version 1.0.0-rc1-82-82e7a8eb and it does not suffer from the issues mentioned above." strange, that doesn't make sense to me, I'm really not sure what is happening here then
Same problem when using LightGBMRegressor on Databricks runtime 6.6 ML. The feature importance looks strange and a lot of zero there.
v1.0 importance: [0.0, 0.0, 0.0, 17.0, 0.0, 0.0, 0.0, 0.0, 26.0, 41.0, 23.0, 0.0, 13.0, 65.0, 0.0, 0.0, 55.0, 40.0, 0.0, 9.0, 8.0, 1.0, 0.0, 22.0, 1.0, 0.0, 0.0, 5.0, 0.0, 0.0, 21.0, 0.0, 0.0, 0.0, 5.0, 0.0, 76.0, 6.0, 0.0, 23.0, 0.0, 0.0, 0.0, 5.0, 29.0, 9.0, 0.0, 0.0]
v0.17:[287.0, 2115.0, 1143.0, 1462.0, 251.0, 326.0, 447.0, 392.0, 508.0, 513.0, 662.0, 493.0, 490.0, 643.0, 648.0, 467.0, 597.0, 706.0, 669.0, 620.0, 370.0, 181.0, 420.0, 572.0, 213.0, 576.0, 242.0, 890.0, 70.0, 7.0, 447.0, 310.0, 291.0, 938.0, 793.0, 1079.0, 1420.0, 36.0, 48.0, 473.0, 163.0, 0.0, 360.0, 256.0, 742.0, 305.0, 210.0, 149.0]
Same parameter used for v1.0.0 and 0.17:
model = LightGBMRegressor(objective='tweedie',
alpha=0.2,
learningRate=0.3,
numLeaves=251,
lambdaL1 = 6.697076092443325,
lambdaL2 = 1.1755840927187936e-06,
featureFraction = 0.4009710314281902,
baggingFraction = 0.6888056643745915,
baggingFreq = 4,
validationIndicatorCol='validation',
earlyStoppingRound=100).fit(train)
@imatiach-msft is seems to me the package was broken somewhere after this commit (82e7a8eb).
Here is the log of a simple regression task using rc3 (same issue with rc2 but rc1 and the commit I referenced above are ok), I used spark 2.4.7 (scala 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_265) on GCP (debian image)
notice how the l2 loss explodes after one iteration:
20/10/15 08:45:38 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM task generating dense dataset with 137086 rows and 100 columns
20/10/15 08:45:44 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM task generating dense dataset with 362913 rows and 100 columns
20/10/15 08:45:47 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM task calling LGBM_BoosterUpdateOneIter
20/10/15 08:46:06 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM running iteration: 0 with result: 0 and is finished: false
20/10/15 08:46:06 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: Valid l2=3.693259343780418
20/10/15 08:46:06 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM task calling LGBM_BoosterUpdateOneIter
20/10/15 08:46:14 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM running iteration: 1 with result: 0 and is finished: false
20/10/15 08:46:14 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: Valid l2=87.718556427486
20/10/15 08:46:14 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM task calling LGBM_BoosterUpdateOneIter
20/10/15 08:46:21 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM running iteration: 2 with result: 0 and is finished: false
20/10/15 08:46:21 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: Valid l2=1.0238538051084094E7
20/10/15 08:46:21 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM task calling LGBM_BoosterUpdateOneIter
20/10/15 08:46:27 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM running iteration: 3 with result: 0 and is finished: false
20/10/15 08:46:27 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: Valid l2=6.060025082346267E11
20/10/15 08:46:27 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM task calling LGBM_BoosterUpdateOneIter
20/10/15 08:46:33 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM running iteration: 4 with result: 0 and is finished: false
20/10/15 08:46:33 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: Valid l2=3.4198023818807816E16
20/10/15 08:46:33 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM task calling LGBM_BoosterUpdateOneIter
20/10/15 08:46:39 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM running iteration: 5 with result: 0 and is finished: false
20/10/15 08:46:39 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: Valid l2=5.177838695259177E18
20/10/15 08:46:39 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: Early stopping, best iteration is 0
Maybe it's not going to fix this issue, but it could be a good idea to upgrade the bundled LightGBM lib to version 3.1.0 stable.
@ekerazha LightGBM 3.1.0 does not seem to be available in maven central.
@samins yes, I need to do another release for 3.1.0, I do the maven releases for lightgbm. I've just been extremely busy with another project release for the past several months.
was this resolved ? I'm having similar issues and my data is highly imbalanced
The current release v1.0.0-rc3 still has this issue. The v1.0.0-rc1 version is the latest one without it.
I've released latest lightgbm and upgraded it. Hope it fixes this issue in latest master, as I'm not able to reproduce it (the metrics in the tests didn't seem to change, and similarly in the notebook I run for performance/accuracy testing).
would be great to get any updates on this - if anyone has a reproducible example would be glad to look into it
I think this may be a duplicate of this issue: https://github.com/Azure/mmlspark/issues/986 which has been fixed with this PR in lightgbm repository: microsoft/LightGBM#4185 the latest master build has this fix
@imatiach-msft Can you point me to the latest jar file that fixes this issue?