SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

What SHA used to produce the 1.0.0-rc1 JAR?

Open DCameronMauch opened this issue 3 years ago • 21 comments

We are currently using "com.microsoft.ml.spark" %% "mmlspark" % "1.0.0-rc1", which we are getting from https://mmlspark.azureedge.net/maven.

I am want to update this version to be compatible with Spark 3 as all our models were trained with this version.

I did that using the SHA of the commit first tagged with 1.0.0-rc1. But that appears to be not the one the JAR was actually build with. Our spark jobs are failing trying to find com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel$$typecreator8$1. When I look in the JAR file I produced, there is no typecreator8. Only upto 7.

Can you please tell me which commit SHA was used to generate the 1.0.0-rc1 JAR?

I also looked in the JAR file we downloaded from that repo. Couldn't find anything in there to tell me either. The manifest file doesn't have that information.

AB#1190579

DCameronMauch avatar May 21 '21 15:05 DCameronMauch

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

welcome[bot] avatar May 21 '21 15:05 welcome[bot]

@mhamilton723 may know the exact commit since he does the releases. "I am want to update this version to be compatible with Spark 3 as all our models were trained with this version." latest master supports spark 3. Would it be possible to upgrade to latest master and retrain or use the older models? Specifically for lightgbm you can save the native file and reload it in a lightgbm model using saveNativeModel/loadNativeModel.

imatiach-msft avatar May 21 '21 15:05 imatiach-msft

@DCameronMauch pyspark api: https://github.com/Azure/mmlspark/blob/master/src/main/python/mmlspark/lightgbm/mixin.py#L10 java api: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMRegressor.scala#L131 example notebook showing how to use save/load in pyspark: https://github.com/Azure/mmlspark/blob/663d9650d3884ece260a457d9b016088380c2cb9/notebooks/samples/LightGBM%20-%20Overview.ipynb

imatiach-msft avatar May 21 '21 15:05 imatiach-msft

to install mmlspark from master (copied from another thread):

please try this walkthrough with pictures on databricks: https://docs.microsoft.com/en-us/azure/cognitive-services/big-data/getting-started#azure-databricks for spark 2.4.5 you can use rc1 to rc3 releases. For latest spark 3.0 you will need to use a build from master:

image

For example:

Maven Coordinates com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-83-663d9650-SNAPSHOT Maven Resolver https://mmlspark.azureedge.net/maven

imatiach-msft avatar May 21 '21 15:05 imatiach-msft

I was told that the data science team would require several sprints to retrain all our models. And it's time they just don't have. So I need to use a version that is compatible with the models we already trained. I tried using the latest master, but it's not compatible with our models. It generates the error: "java.util.NoSuchElementException: Failed to find a default value for actualNumClasses". And I could find no way to "upgrade" our models. So I'm kind stuck, at least today, to using the 1.0.0-rc1 version. Maybe I can try the SHA right before the 1.0.0-rc2 tag. All our code is Scala, BTW. I'm not very good with Python.

Also, how do you typically build the JARs for release? I just used sbt assembly, but the resulting JAR was huge, as all the dependencies, like Spark, was in there too.

DCameronMauch avatar May 21 '21 15:05 DCameronMauch

@DCameronMauch for this error: java.util.NoSuchElementException: Failed to find a default value for actualNumClasses I've sent a PR here: https://github.com/Azure/mmlspark/pull/1057/files is there a stack trace for that error? Once build completes I can send you the build coordinates so you can try it out.

imatiach-msft avatar May 21 '21 16:05 imatiach-msft

also did you use the saveNativeModel/loadNativeModel or the spark save/load APIs (https://stackoverflow.com/questions/33027767/save-ml-model-for-future-usage)? I would think if you used save/loadNativeModel it wouldn't generate that error, perhaps only the spark save/load.

imatiach-msft avatar May 21 '21 16:05 imatiach-msft

@DCameronMauch for building I follow this guide: https://github.com/Azure/mmlspark/blob/master/docs/developer-readme.md mainly sbt setup, sbt compile - I think sbt package builds the jar after the first two steps

imatiach-msft avatar May 21 '21 16:05 imatiach-msft

Here is the full stack trace for the issue when using master build:

21/05/14 16:04:02 ERROR Instrumentation: java.util.NoSuchElementException: Failed to find a default value for actualNumClasses at org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756) at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41) at org.apache.spark.ml.param.Params.$(params.scala:762) at org.apache.spark.ml.param.Params.$$(params.scala:762) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41) at com.microsoft.ml.spark.lightgbm.HasActualNumClasses.getActualNumClasses(LightGBMClassifier.scala:81) at com.microsoft.ml.spark.lightgbm.HasActualNumClasses.getActualNumClasses$(LightGBMClassifier.scala:81) at com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel.getActualNumClasses(LightGBMClassifier.scala:86) at com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel.numClasses(LightGBMClassifier.scala:153) at org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:185) at org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88) at org.apache.spark.ml.PipelineModel.$anonfun$transformSchema$5(Pipeline.scala:316) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198) at org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:316) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71) at org.apache.spark.ml.PipelineModel.$anonfun$transform$2(Pipeline.scala:309) at org.apache.spark.ml.MLEvents.withTransformEvent(events.scala:148) at org.apache.spark.ml.MLEvents.withTransformEvent$(events.scala:141) at org.apache.spark.ml.util.Instrumentation.withTransformEvent(Instrumentation.scala:42) at org.apache.spark.ml.PipelineModel.$anonfun$transform$1(Pipeline.scala:308) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:307) at com.reonomy.ml.assettype.job.AssetTypePredictionJob$.$anonfun$generateAssetTypePrediction$2(AssetTypePredictionJob.scala:130)

DCameronMauch avatar May 21 '21 16:05 DCameronMauch

@DCameronMauch the build for this PR: https://github.com/Azure/mmlspark/pull/1057 is:

Maven Coordinates com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-85-c987ad0b-SNAPSHOT Maven Resolver https://mmlspark.azureedge.net/maven

please try it out, I hope it fixes your issue

imatiach-msft avatar May 21 '21 16:05 imatiach-msft

@DCameronMauch it looks like this is after loading, when calling transform method. In that case, can you set this parameter on the model after loading the model, before calling transform? Actually, I think my fix might not be correct now... maybe the better fix would be to set actualNumClasses param (if it is not set already) in transform method by using underlying booster, similar to loadNativeModelFromFile:

val lightGBMBooster = new LightGBMBooster(text)
val actualNumClasses = lightGBMBooster.numClasses

imatiach-msft avatar May 21 '21 16:05 imatiach-msft

I am afraid I do not know this library very well. The data science people do all that work. I'm just trying to make their models work with Spark3. Trying your PR release now... Nope, different error this time:

21/05/21 17:36:31 ERROR LightGBMClassificationModel: {"uid":"LightGBMClassifier_510af2ce3f4f","className":"class com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel","method":"transform","buildVersion":"1.0.0-rc3-85-c987ad0b-SNAPSHOT"} java.util.NoSuchElementException: Failed to find a default value for lightGBMBooster at org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756) at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41) at org.apache.spark.ml.param.Params.$(params.scala:762) at org.apache.spark.ml.param.Params.$$(params.scala:762) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41) at com.microsoft.ml.spark.lightgbm.LightGBMModelParams.getLightGBMBooster(LightGBMParams.scala:251) at com.microsoft.ml.spark.lightgbm.LightGBMModelParams.getLightGBMBooster$(LightGBMParams.scala:251) at com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel.getLightGBMBooster(LightGBMClassifier.scala:90) at com.microsoft.ml.spark.lightgbm.LightGBMModelParams.getModel(LightGBMParams.scala:258) at com.microsoft.ml.spark.lightgbm.LightGBMModelParams.getModel$(LightGBMParams.scala:258) at com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel.getModel(LightGBMClassifier.scala:90) at com.microsoft.ml.spark.lightgbm.LightGBMModelMethods.updateBoosterParamsBeforePredict(LightGBMModelMethods.scala:112) at com.microsoft.ml.spark.lightgbm.LightGBMModelMethods.updateBoosterParamsBeforePredict$(LightGBMModelMethods.scala:111) at com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel.updateBoosterParamsBeforePredict(LightGBMClassifier.scala:90) at com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel.$anonfun$transform$1(LightGBMClassifier.scala:110) at com.microsoft.ml.spark.logging.BasicLogging.logTransform(BasicLogging.scala:71) at com.microsoft.ml.spark.logging.BasicLogging.logTransform$(BasicLogging.scala:68) at com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel.logTransform(LightGBMClassifier.scala:90) at com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel.transform(LightGBMClassifier.scala:109) at org.apache.spark.ml.PipelineModel.$anonfun$transform$4(Pipeline.scala:311) at org.apache.spark.ml.MLEvents.withTransformEvent(events.scala:148) at org.apache.spark.ml.MLEvents.withTransformEvent$(events.scala:141) at org.apache.spark.ml.util.Instrumentation.withTransformEvent(Instrumentation.scala:42) at org.apache.spark.ml.PipelineModel.$anonfun$transform$3(Pipeline.scala:311) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198) at org.apache.spark.ml.PipelineModel.$anonfun$transform$2(Pipeline.scala:310) at org.apache.spark.ml.MLEvents.withTransformEvent(events.scala:148) at org.apache.spark.ml.MLEvents.withTransformEvent$(events.scala:141) at org.apache.spark.ml.util.Instrumentation.withTransformEvent(Instrumentation.scala:42) at org.apache.spark.ml.PipelineModel.$anonfun$transform$1(Pipeline.scala:308) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:307) at com.reonomy.ml.assettype.job.AssetTypePredictionJob$.$anonfun$generateAssetTypePrediction$2(AssetTypePredictionJob.scala:126) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at com.reonomy.ml.assettype.job.AssetTypePredictionJob$.generateAssetTypePrediction(AssetTypePredictionJob.scala:124) at com.reonomy.ml.assettype.job.AssetTypePredictionJob$.execJob(AssetTypePredictionJob.scala:40) at com.reonomy.spark.pipeline.DataJob.main(DataJob.scala:57) at com.reonomy.spark.pipeline.ValidatedDerivedJob.main(ReonomySteps.scala:52) at com.reonomy.ml.assettype.job.AssetTypePredictionJob.main(AssetTypePredictionJob.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:936) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1015) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1024) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I am guessing this is a compatibility issue between the different version of LightGBM.

So I am back to my original request - the commit SHA used for the 1.0.0-rc1 release.

DCameronMauch avatar May 21 '21 17:05 DCameronMauch

That repository, https://mmlspark.azureedge.net/maven, is there any way for me to get a listing of the JARs available? We may be switching to Spark 3.1.1 instead of Spark 3.0.1, since a new AWS/EMR image with that was just released.

DCameronMauch avatar May 21 '21 17:05 DCameronMauch

@DCameronMauch no, there is no way to list, only @mhamilton723 has access to that blob storage, maybe he can help with that.

I would recommend to try the saveNativeModel/loadNativeModel APIs instead of the spark save/load for the lightgbm classifier, and then just patch the stage in the spark pipeline.

It looks like the LightGBMBooster parameter was moved to LightGBMModelParams out of LightGBMClassificationModel, which is what is causing this error, at this commit: https://github.com/Azure/mmlspark/commit/840781a2ae6c3e9ee0a065294c893e53df576de7 This was a breaking change. I wonder if there is a way for me to patch this for you as well.

imatiach-msft avatar May 21 '21 18:05 imatiach-msft

So, is it possible to Spark load the old model, then save native? Then inside the code do the load native and patch? I am assuming the patch will make it compatible with the current version of LightGBM. Yes? Can you tell me more about this patch process?

Also, how can get get ahold of @mhamilton723 to ask about the commit SHA?

BTW, I really appreciate all the support!

DCameronMauch avatar May 21 '21 18:05 DCameronMauch

"So, is it possible to Spark load the old model, then save native? Then inside the code do the load native and patch" Yes, so you can load the old model, call the saveNativeModel method, and then in the new environment call loadNativeModel to create a new LightGBMClassifier. Actually, in your already saved pipeline we call saveNativeModel for the booster so I think you can just reuse the file that's in that folder and call loadNativeModel on it. Then you should be able to patch the stage in the pipeline.

Wait, is there a problem here? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L296

pipelinemodel is private in spark, you can only create it from pipeline after calling fit or load it. You can't set the stages array to a new allocated array, the array is a val: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L298 But scala arrays are just java arrays, and it looks like stages is a public variable, so actually I think you can just set the value on it: https://stackoverflow.com/questions/9384817/scala-updating-array-elements So maybe it will work, since you are not creating a new array but just patching an element

imatiach-msft avatar May 21 '21 19:05 imatiach-msft

Hey @DCameronMauch thanks for your patience you can find the SHA by going to the tagged branches:

SHA for 1.0.0-rc1: 8d31c026a252677654717768e942e1cf1adc9082

Regarding versions, 1.0.0-rc3 is the last version to support spark 2, since then we have switched to spark3 and you can grab those versions by looking at the master build github badge.

Heres the latest one to save you the writing (wee need to figure out a good copy-paste solution for master builds):

MMLSpark Build and Release Information Maven Coordinates com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-83-663d9650-SNAPSHOT

Maven Resolver https://mmlspark.azureedge.net/maven

Documentation Pages: Scala Documentation Python Documentation

mhamilton723 avatar May 21 '21 20:05 mhamilton723

I looked at the tagged branches and used that SHA. It does not appear to be the SHA that the released 1.0.0-rc1 was build with. When I try to load back in the model, it tried and fails to load the com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel$$typecreator8$1 class. Please try a sbt clean compile package, and then use tar to look at the classes present. You will see typecreator classes up to 7, not 8. The SHA used to build that release had an additional internal class.

DCameronMauch avatar May 21 '21 22:05 DCameronMauch

@DCameronMauch did you try my suggestion to patch the pipeline? I'm honestly not sure what the typecreator difference is from, could it be the java version the code is compiled from? Are there other big differences you see, specifically new parameters/code structure or something similar, that might help us understand what the difference might be due to?

imatiach-msft avatar May 24 '21 14:05 imatiach-msft

I wonder if it's possible that it included this PR: https://github.com/Azure/mmlspark/pull/714 at commit 95b7ef006d5cdb77346beb826130dc31239fa1db I don't do the mmlspark releases though so I'm just guessing. Looking at history that's the only nearby lightgbm change, otherwise there is also this PR: https://github.com/Azure/mmlspark/pull/712 but it was in November already, several weeks after the rc1 version bump.

imatiach-msft avatar May 24 '21 14:05 imatiach-msft

gentle ping @DCameronMauch , just wondering if you are unblocked now?

imatiach-msft avatar May 25 '21 18:05 imatiach-msft