SynapseML
SynapseML copied to clipboard
Incorrect rawPrediction and probability from scoring test data?
Hi, I was training a binary model using mmlspark lightgbm, and found very weird rawPrediction and probability after scoring the test data. I ran the codes as follows:
" from mmlspark.train import ComputeModelStatistics, TrainedClassifierModel predictionModel = LightGBMClassificationModel.loadNativeModelFromFile("s3a://cof-risk-ccrm-mad/users/dhq076/rt_v30_rebuld_ml/ums_ndq_201607_ds") prediction = predictionModel.transform(test) prediction.limit(10).toPandas() "
The resulting raw prediction takes the form such as "[0.8350025657401177, -0.8350025657401177]", and the probability looks like "[1.8350025657401177, -0.8350025657401177]". Although the resulting prediction takes form of 1 or 0, the value of raw prediction and probability just look weird. Is this a bug or is what we should expect to get? If it's not a bug, then how do we interpret the raw prediction and probability?
@devilwing0723 what version of mmlspark are you using? I recall this issue was fixed recently. There were actually several related issues like this. https://github.com/Azure/mmlspark/pull/676 https://github.com/Azure/mmlspark/pull/578 and one related PR to lightgbm: https://github.com/microsoft/LightGBM/pull/2356
@imatiach-msft I used the mmlspark_2.11 JAR 0.18.1 downloaded from https://jar-download.com/artifacts/com.microsoft.ml.spark/mmlspark_2.11/0.18.1/source-code. I saw in a document of the repo that the package needs to be loaded from maven, I first downloaded the same package from maven website, but that didn't work for me. I then figured out that the package in the above mentioned website did work. It appeared to be the most updated version.
it looks like 0.18.1 does not have the fix: https://mvnrepository.com/artifact/com.microsoft.ml.spark/mmlspark_2.11/0.18.1 it uses lightgbm 2.2.350 but fix was in 2.2.400. Using the RC 1.0 version or latest snapshot should have the fix. Not sure when the next release will be out, adding @mhamilton723
@imatiach-msft Thanks for the clarification. Can you refer me to the most recent version of mmlspark? And another thing I want to clarify is, what is the rawPrediction from scoring the data when I train the model with binary objective? It doesn't look like anything I am familiar with. In the local version of lightgbm, we can determine if we want to score the test data as the logit for binary target. Is this doable in mmlspark lightgbm?
I am using com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc3
(on Databricks) and am still having the problem. Is this expected? Thank you!

After installing the latest version of mmlspark
(coordinate "com.microsoft.ml.spark:mmlspark:1.0.0-rc4"; and using Spark 3.1.1 & Scala 2.12), things work fine now! Thanks @imatiach-msft!