SynapseML Upgrading to 1.0.0-rc2 results in a large drop in classification performance using LightGBMClassifier.

Describe the bug Updating mmlspark from 1.0.0-rc1-51-df0244c7-SNAPSHOT to 1.0.0-rc2, while keeping all other aspects of my code the same, results in a large drop in validation Average Precision when using LightGBMClassifier: from 0.574 to 0.313

params = {
  'num_trees': 1000,
  'early_stopping_rounds': 0,
  'feature_fraction': 0.7,
  'l1_reg': 0.0,
  'l2_reg': 0.0,
  'max_depth': -1,
  'num_leaves': 31,
  'is_unbalance': True
}

lgb = LightGBMClassifier(
  featuresCol='features',
  labelCol='Label',
  slotNames=features,
  categoricalSlotNames=idx_cat_cols,
  timeout=12000.0,
  useBarrierExecutionMode=True,
  numIterations=params['num_trees'],
  isUnbalance=params['is_unbalance'],
  earlyStoppingRound=params['early_stopping_rounds'],
  featureFraction=params['feature_fraction'],
  lambdaL1=params['l1_reg'],
  lambdaL2=params['l2_reg'],
  maxDepth=params['max_depth'],
  numLeaves=params['num_leaves']
)

To Reproduce I am seeing this result on a private dataset with 140,000,000 rows and 130 feature columns. I am a Microsoft employee so we can talk offline if more details are needed.

Expected behavior Comparable validation performance between versions.

Info (please complete the following information):

MMLSpark Version: 1.0.0-rc2
Spark Version: 2.4.5
Spark Platform: Databricks (runtime 6.6 ML)

If the bug pertains to a specific feature please tag the appropriate CODEOWNER for better visibility @imatiach-msft

Additional context Did any underlying default settings change?

Sep 02 '20 19:09 tyler-romero

Same problem here with the LightGBMRegressor in 1.0.0-rc2, Spark 2.4.5 and Databricks runtime 6.6 ML. Complete garbage predictions in my use case, while rc1 was running fine.

Sep 03 '20 19:09 brunocous

@tyler-romero @brunocous this is very concerning, I wonder if some new parameter that was added might be causing this difference? Any way to get a repro of the issue?

Sep 04 '20 02:09 imatiach-msft

@brunocous do you have unbalanced data too? I reached out to @tyler-romero , he is working on a repro, but his parameters seemed pretty standard other than that his data was unbalanced. I wonder if there might have been some regression for unbalanced data? - just my intuition at this point.

Sep 04 '20 17:09 imatiach-msft

@imatiach-msft I'll see if I can reproduce the error on Databricks with a toy dataset.

My specific regression application is a predicting a delay, where 95% of entities are on time. Although the real business value lies in predicting the other 5% accurately. So yes, it is unbalanced in some sense?

My parameters are:

        "objective": "regression",
        "learningRate": 0.5,
        "numLeaves": 150,
        "lambdaL1": 0.2,
        "lambdaL2": 0.9,
        "maxDepth": 35,
        "boostingType": "goss",
        "earlyStoppingRound": 1,
        "numIterations": 110,

Besides that I assign a larger weight to delayed samples such that my model picks it up more easily. I tried to play with all of them using hyperparameter tuning frameworks, but not a single combination could achieve the level of performance of the rc1 version.

Sep 04 '20 21:09 brunocous

@brunocous ah, sorry, you are using regressor and not classifier, I should have noticed that - sure it's unbalanced but not in the classification sense of a majority/minority class. Interesting that you are also seeing this issue for the regressor. We do have unit tests that validate metrics like accuracy, so I would have thought they would catch that. I wonder what went wrong with the rc2 release.

Sep 04 '20 21:09 imatiach-msft

Same problems here with the LightGBMClassifier in 1.0.0-rc2, Spark 2.4.3 on GCP. An unbalanced binary classification problem with ~2M instances and ~130 features. To test, I used default hyperparams with isUnbalance=True and earlyStoppingRound=10. In rc1, logloss drops steadily to ~0.5 and early stopping kicks in at iteration ~30 when logloss slightly grows to ~0.55. In rc2 for exact same code, in iteration ~10 logloss starts exploding growing rapidly from ~0.55 to ~1.5 in a couple of iterations. naturally results are suboptimal with AUC on validation folds all over the place (0.7-0.83) whereas in rc1 in all folds have AUC ~0.84.

EDIT: I checked a locally compiled version 1.0.0-rc1-82-82e7a8eb and it does not suffer from the issues mentioned above.

Sep 10 '20 11:09 samins

@samins "I checked a locally compiled version 1.0.0-rc1-82-82e7a8eb and it does not suffer from the issues mentioned above." strange, that doesn't make sense to me, I'm really not sure what is happening here then

Sep 18 '20 04:09 imatiach-msft

Same problem when using LightGBMRegressor on Databricks runtime 6.6 ML. The feature importance looks strange and a lot of zero there. v1.0 importance: [0.0, 0.0, 0.0, 17.0, 0.0, 0.0, 0.0, 0.0, 26.0, 41.0, 23.0, 0.0, 13.0, 65.0, 0.0, 0.0, 55.0, 40.0, 0.0, 9.0, 8.0, 1.0, 0.0, 22.0, 1.0, 0.0, 0.0, 5.0, 0.0, 0.0, 21.0, 0.0, 0.0, 0.0, 5.0, 0.0, 76.0, 6.0, 0.0, 23.0, 0.0, 0.0, 0.0, 5.0, 29.0, 9.0, 0.0, 0.0] v0.17:[287.0, 2115.0, 1143.0, 1462.0, 251.0, 326.0, 447.0, 392.0, 508.0, 513.0, 662.0, 493.0, 490.0, 643.0, 648.0, 467.0, 597.0, 706.0, 669.0, 620.0, 370.0, 181.0, 420.0, 572.0, 213.0, 576.0, 242.0, 890.0, 70.0, 7.0, 447.0, 310.0, 291.0, 938.0, 793.0, 1079.0, 1420.0, 36.0, 48.0, 473.0, 163.0, 0.0, 360.0, 256.0, 742.0, 305.0, 210.0, 149.0] Same parameter used for v1.0.0 and 0.17:

model = LightGBMRegressor(objective='tweedie',
                          alpha=0.2,
                          learningRate=0.3,
                          numLeaves=251,
                          lambdaL1 = 6.697076092443325, 
                          lambdaL2 = 1.1755840927187936e-06, 
                          featureFraction = 0.4009710314281902, 
                          baggingFraction = 0.6888056643745915, 
                          baggingFreq = 4,
                         validationIndicatorCol='validation',
                         earlyStoppingRound=100).fit(train)

Sep 25 '20 06:09 Zeyu1994

@imatiach-msft is seems to me the package was broken somewhere after this commit (82e7a8eb).

Here is the log of a simple regression task using rc3 (same issue with rc2 but rc1 and the commit I referenced above are ok), I used spark 2.4.7 (scala 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_265) on GCP (debian image)

notice how the l2 loss explodes after one iteration:

20/10/15 08:45:38 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM task generating dense dataset with 137086 rows and 100 columns
20/10/15 08:45:44 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM task generating dense dataset with 362913 rows and 100 columns
20/10/15 08:45:47 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM task calling LGBM_BoosterUpdateOneIter
20/10/15 08:46:06 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM running iteration: 0 with result: 0 and is finished: false
20/10/15 08:46:06 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: Valid l2=3.693259343780418
20/10/15 08:46:06 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM task calling LGBM_BoosterUpdateOneIter
20/10/15 08:46:14 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM running iteration: 1 with result: 0 and is finished: false
20/10/15 08:46:14 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: Valid l2=87.718556427486
20/10/15 08:46:14 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM task calling LGBM_BoosterUpdateOneIter
20/10/15 08:46:21 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM running iteration: 2 with result: 0 and is finished: false
20/10/15 08:46:21 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: Valid l2=1.0238538051084094E7
20/10/15 08:46:21 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM task calling LGBM_BoosterUpdateOneIter
20/10/15 08:46:27 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM running iteration: 3 with result: 0 and is finished: false
20/10/15 08:46:27 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: Valid l2=6.060025082346267E11
20/10/15 08:46:27 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM task calling LGBM_BoosterUpdateOneIter
20/10/15 08:46:33 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM running iteration: 4 with result: 0 and is finished: false
20/10/15 08:46:33 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: Valid l2=3.4198023818807816E16
20/10/15 08:46:33 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM task calling LGBM_BoosterUpdateOneIter
20/10/15 08:46:39 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: LightGBM running iteration: 5 with result: 0 and is finished: false
20/10/15 08:46:39 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: Valid l2=5.177838695259177E18
20/10/15 08:46:39 INFO com.microsoft.ml.spark.lightgbm.LightGBMRegressor: Early stopping, best iteration is 0

Oct 15 '20 08:10 samins

Maybe it's not going to fix this issue, but it could be a good idea to upgrade the bundled LightGBM lib to version 3.1.0 stable.

Nov 25 '20 18:11 ekerazha

@ekerazha LightGBM 3.1.0 does not seem to be available in maven central.

Dec 04 '20 13:12 samins

@samins yes, I need to do another release for 3.1.0, I do the maven releases for lightgbm. I've just been extremely busy with another project release for the past several months.

Dec 04 '20 15:12 imatiach-msft

was this resolved ? I'm having similar issues and my data is highly imbalanced

Mar 10 '21 16:03 shaimamajeed

The current release v1.0.0-rc3 still has this issue. The v1.0.0-rc1 version is the latest one without it.

Mar 11 '21 09:03 brunocous

I've released latest lightgbm and upgraded it. Hope it fixes this issue in latest master, as I'm not able to reproduce it (the metrics in the tests didn't seem to change, and similarly in the notebook I run for performance/accuracy testing).

Apr 19 '21 05:04 imatiach-msft

would be great to get any updates on this - if anyone has a reproducible example would be glad to look into it

Apr 21 '21 06:04 imatiach-msft

I think this may be a duplicate of this issue: https://github.com/Azure/mmlspark/issues/986 which has been fixed with this PR in lightgbm repository: microsoft/LightGBM#4185 the latest master build has this fix

May 03 '21 04:05 imatiach-msft

@imatiach-msft Can you point me to the latest jar file that fixes this issue?

Sep 09 '21 00:09 AllardJM