SynapseML
SynapseML copied to clipboard
What SHA used to produce the 1.0.0-rc1 JAR?
We are currently using "com.microsoft.ml.spark" %% "mmlspark" % "1.0.0-rc1"
,
which we are getting from https://mmlspark.azureedge.net/maven
.
I am want to update this version to be compatible with Spark 3 as all our models were trained with this version.
I did that using the SHA of the commit first tagged with 1.0.0-rc1
. But that appears to be not the one the JAR was actually build with. Our spark jobs are failing trying to find com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel$$typecreator8$1
. When I look in the JAR file I produced, there is no typecreator8
. Only upto 7.
Can you please tell me which commit SHA was used to generate the 1.0.0-rc1
JAR?
I also looked in the JAR file we downloaded from that repo. Couldn't find anything in there to tell me either. The manifest file doesn't have that information.
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.
@mhamilton723 may know the exact commit since he does the releases. "I am want to update this version to be compatible with Spark 3 as all our models were trained with this version." latest master supports spark 3. Would it be possible to upgrade to latest master and retrain or use the older models? Specifically for lightgbm you can save the native file and reload it in a lightgbm model using saveNativeModel/loadNativeModel.
@DCameronMauch pyspark api: https://github.com/Azure/mmlspark/blob/master/src/main/python/mmlspark/lightgbm/mixin.py#L10 java api: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMRegressor.scala#L131 example notebook showing how to use save/load in pyspark: https://github.com/Azure/mmlspark/blob/663d9650d3884ece260a457d9b016088380c2cb9/notebooks/samples/LightGBM%20-%20Overview.ipynb
to install mmlspark from master (copied from another thread):
please try this walkthrough with pictures on databricks: https://docs.microsoft.com/en-us/azure/cognitive-services/big-data/getting-started#azure-databricks for spark 2.4.5 you can use rc1 to rc3 releases. For latest spark 3.0 you will need to use a build from master:
For example:
Maven Coordinates com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-83-663d9650-SNAPSHOT Maven Resolver https://mmlspark.azureedge.net/maven
I was told that the data science team would require several sprints to retrain all our models. And it's time they just don't have. So I need to use a version that is compatible with the models we already trained. I tried using the latest master, but it's not compatible with our models. It generates the error: "java.util.NoSuchElementException: Failed to find a default value for actualNumClasses". And I could find no way to "upgrade" our models. So I'm kind stuck, at least today, to using the 1.0.0-rc1 version. Maybe I can try the SHA right before the 1.0.0-rc2 tag. All our code is Scala, BTW. I'm not very good with Python.
Also, how do you typically build the JARs for release? I just used sbt assembly
, but the resulting JAR was huge, as all the dependencies, like Spark, was in there too.
@DCameronMauch for this error: java.util.NoSuchElementException: Failed to find a default value for actualNumClasses I've sent a PR here: https://github.com/Azure/mmlspark/pull/1057/files is there a stack trace for that error? Once build completes I can send you the build coordinates so you can try it out.
also did you use the saveNativeModel/loadNativeModel or the spark save/load APIs (https://stackoverflow.com/questions/33027767/save-ml-model-for-future-usage)? I would think if you used save/loadNativeModel it wouldn't generate that error, perhaps only the spark save/load.
@DCameronMauch for building I follow this guide: https://github.com/Azure/mmlspark/blob/master/docs/developer-readme.md mainly sbt setup, sbt compile - I think sbt package builds the jar after the first two steps
Here is the full stack trace for the issue when using master build:
21/05/14 16:04:02 ERROR Instrumentation: java.util.NoSuchElementException: Failed to find a default value for actualNumClasses at org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756) at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41) at org.apache.spark.ml.param.Params.$(params.scala:762) at org.apache.spark.ml.param.Params.$$(params.scala:762) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41) at com.microsoft.ml.spark.lightgbm.HasActualNumClasses.getActualNumClasses(LightGBMClassifier.scala:81) at com.microsoft.ml.spark.lightgbm.HasActualNumClasses.getActualNumClasses$(LightGBMClassifier.scala:81) at com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel.getActualNumClasses(LightGBMClassifier.scala:86) at com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel.numClasses(LightGBMClassifier.scala:153) at org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:185) at org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88) at org.apache.spark.ml.PipelineModel.$anonfun$transformSchema$5(Pipeline.scala:316) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198) at org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:316) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71) at org.apache.spark.ml.PipelineModel.$anonfun$transform$2(Pipeline.scala:309) at org.apache.spark.ml.MLEvents.withTransformEvent(events.scala:148) at org.apache.spark.ml.MLEvents.withTransformEvent$(events.scala:141) at org.apache.spark.ml.util.Instrumentation.withTransformEvent(Instrumentation.scala:42) at org.apache.spark.ml.PipelineModel.$anonfun$transform$1(Pipeline.scala:308) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:307) at com.reonomy.ml.assettype.job.AssetTypePredictionJob$.$anonfun$generateAssetTypePrediction$2(AssetTypePredictionJob.scala:130)
@DCameronMauch the build for this PR: https://github.com/Azure/mmlspark/pull/1057 is:
Maven Coordinates com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-85-c987ad0b-SNAPSHOT Maven Resolver https://mmlspark.azureedge.net/maven
please try it out, I hope it fixes your issue
@DCameronMauch it looks like this is after loading, when calling transform method. In that case, can you set this parameter on the model after loading the model, before calling transform? Actually, I think my fix might not be correct now... maybe the better fix would be to set actualNumClasses param (if it is not set already) in transform method by using underlying booster, similar to loadNativeModelFromFile:
val lightGBMBooster = new LightGBMBooster(text)
val actualNumClasses = lightGBMBooster.numClasses
I am afraid I do not know this library very well. The data science people do all that work. I'm just trying to make their models work with Spark3. Trying your PR release now... Nope, different error this time:
21/05/21 17:36:31 ERROR LightGBMClassificationModel: {"uid":"LightGBMClassifier_510af2ce3f4f","className":"class com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel","method":"transform","buildVersion":"1.0.0-rc3-85-c987ad0b-SNAPSHOT"} java.util.NoSuchElementException: Failed to find a default value for lightGBMBooster at org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756) at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41) at org.apache.spark.ml.param.Params.$(params.scala:762) at org.apache.spark.ml.param.Params.$$(params.scala:762) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41) at com.microsoft.ml.spark.lightgbm.LightGBMModelParams.getLightGBMBooster(LightGBMParams.scala:251) at com.microsoft.ml.spark.lightgbm.LightGBMModelParams.getLightGBMBooster$(LightGBMParams.scala:251) at com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel.getLightGBMBooster(LightGBMClassifier.scala:90) at com.microsoft.ml.spark.lightgbm.LightGBMModelParams.getModel(LightGBMParams.scala:258) at com.microsoft.ml.spark.lightgbm.LightGBMModelParams.getModel$(LightGBMParams.scala:258) at com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel.getModel(LightGBMClassifier.scala:90) at com.microsoft.ml.spark.lightgbm.LightGBMModelMethods.updateBoosterParamsBeforePredict(LightGBMModelMethods.scala:112) at com.microsoft.ml.spark.lightgbm.LightGBMModelMethods.updateBoosterParamsBeforePredict$(LightGBMModelMethods.scala:111) at com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel.updateBoosterParamsBeforePredict(LightGBMClassifier.scala:90) at com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel.$anonfun$transform$1(LightGBMClassifier.scala:110) at com.microsoft.ml.spark.logging.BasicLogging.logTransform(BasicLogging.scala:71) at com.microsoft.ml.spark.logging.BasicLogging.logTransform$(BasicLogging.scala:68) at com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel.logTransform(LightGBMClassifier.scala:90) at com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel.transform(LightGBMClassifier.scala:109) at org.apache.spark.ml.PipelineModel.$anonfun$transform$4(Pipeline.scala:311) at org.apache.spark.ml.MLEvents.withTransformEvent(events.scala:148) at org.apache.spark.ml.MLEvents.withTransformEvent$(events.scala:141) at org.apache.spark.ml.util.Instrumentation.withTransformEvent(Instrumentation.scala:42) at org.apache.spark.ml.PipelineModel.$anonfun$transform$3(Pipeline.scala:311) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198) at org.apache.spark.ml.PipelineModel.$anonfun$transform$2(Pipeline.scala:310) at org.apache.spark.ml.MLEvents.withTransformEvent(events.scala:148) at org.apache.spark.ml.MLEvents.withTransformEvent$(events.scala:141) at org.apache.spark.ml.util.Instrumentation.withTransformEvent(Instrumentation.scala:42) at org.apache.spark.ml.PipelineModel.$anonfun$transform$1(Pipeline.scala:308) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:307) at com.reonomy.ml.assettype.job.AssetTypePredictionJob$.$anonfun$generateAssetTypePrediction$2(AssetTypePredictionJob.scala:126) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at com.reonomy.ml.assettype.job.AssetTypePredictionJob$.generateAssetTypePrediction(AssetTypePredictionJob.scala:124) at com.reonomy.ml.assettype.job.AssetTypePredictionJob$.execJob(AssetTypePredictionJob.scala:40) at com.reonomy.spark.pipeline.DataJob.main(DataJob.scala:57) at com.reonomy.spark.pipeline.ValidatedDerivedJob.main(ReonomySteps.scala:52) at com.reonomy.ml.assettype.job.AssetTypePredictionJob.main(AssetTypePredictionJob.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:936) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1015) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1024) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I am guessing this is a compatibility issue between the different version of LightGBM.
So I am back to my original request - the commit SHA used for the 1.0.0-rc1 release.
That repository, https://mmlspark.azureedge.net/maven
, is there any way for me to get a listing of the JARs available?
We may be switching to Spark 3.1.1 instead of Spark 3.0.1, since a new AWS/EMR image with that was just released.
@DCameronMauch no, there is no way to list, only @mhamilton723 has access to that blob storage, maybe he can help with that.
I would recommend to try the saveNativeModel/loadNativeModel APIs instead of the spark save/load for the lightgbm classifier, and then just patch the stage in the spark pipeline.
It looks like the LightGBMBooster parameter was moved to LightGBMModelParams out of LightGBMClassificationModel, which is what is causing this error, at this commit: https://github.com/Azure/mmlspark/commit/840781a2ae6c3e9ee0a065294c893e53df576de7 This was a breaking change. I wonder if there is a way for me to patch this for you as well.
So, is it possible to Spark load the old model, then save native? Then inside the code do the load native and patch? I am assuming the patch will make it compatible with the current version of LightGBM. Yes? Can you tell me more about this patch process?
Also, how can get get ahold of @mhamilton723 to ask about the commit SHA?
BTW, I really appreciate all the support!
"So, is it possible to Spark load the old model, then save native? Then inside the code do the load native and patch" Yes, so you can load the old model, call the saveNativeModel method, and then in the new environment call loadNativeModel to create a new LightGBMClassifier. Actually, in your already saved pipeline we call saveNativeModel for the booster so I think you can just reuse the file that's in that folder and call loadNativeModel on it. Then you should be able to patch the stage in the pipeline.
Wait, is there a problem here? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L296
pipelinemodel is private in spark, you can only create it from pipeline after calling fit or load it. You can't set the stages array to a new allocated array, the array is a val: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L298 But scala arrays are just java arrays, and it looks like stages is a public variable, so actually I think you can just set the value on it: https://stackoverflow.com/questions/9384817/scala-updating-array-elements So maybe it will work, since you are not creating a new array but just patching an element
Hey @DCameronMauch thanks for your patience you can find the SHA by going to the tagged branches:
SHA for 1.0.0-rc1: 8d31c026a252677654717768e942e1cf1adc9082
Regarding versions, 1.0.0-rc3 is the last version to support spark 2, since then we have switched to spark3 and you can grab those versions by looking at the master build github badge.
Heres the latest one to save you the writing (wee need to figure out a good copy-paste solution for master builds):
MMLSpark Build and Release Information Maven Coordinates com.microsoft.ml.spark:mmlspark_2.12:1.0.0-rc3-83-663d9650-SNAPSHOT
Maven Resolver https://mmlspark.azureedge.net/maven
Documentation Pages: Scala Documentation Python Documentation
I looked at the tagged branches and used that SHA. It does not appear to be the SHA that the released 1.0.0-rc1 was build with. When I try to load back in the model, it tried and fails to load the com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel$$typecreator8$1
class. Please try a sbt clean compile package
, and then use tar
to look at the classes present. You will see typecreator
classes up to 7, not 8. The SHA used to build that release had an additional internal class.
@DCameronMauch did you try my suggestion to patch the pipeline? I'm honestly not sure what the typecreator difference is from, could it be the java version the code is compiled from? Are there other big differences you see, specifically new parameters/code structure or something similar, that might help us understand what the difference might be due to?
I wonder if it's possible that it included this PR: https://github.com/Azure/mmlspark/pull/714 at commit 95b7ef006d5cdb77346beb826130dc31239fa1db I don't do the mmlspark releases though so I'm just guessing. Looking at history that's the only nearby lightgbm change, otherwise there is also this PR: https://github.com/Azure/mmlspark/pull/712 but it was in November already, several weeks after the rc1 version bump.
gentle ping @DCameronMauch , just wondering if you are unblocked now?