SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

java.lang.UnsatisfiedLinkError: com.microsoft.ml.lightgbm.lightgbmlibJNI.voidpp_handle [BUG]

Open timpiperseek opened this issue 2 years ago • 13 comments

SynapseML version

0.10.0

System information

  • Language version: python 3.8.10
  • Spark Version: 3.2.1
  • Spark Platform Synapse - Databricks

So I have followed the installation instruction into databricks as per https://microsoft.github.io/SynapseML/docs/getting_started/installation/#databricks

Describe the problem

Conversion of predictions data set to numpy is resulting in unsatisfied link error java.lang.UnsatisfiedLinkError: com.microsoft.ml.lightgbm.lightgbmlibJNI.voidpp_handle

I think this is the line that is causing the error.

Now I find it hard to believe that my bit of code is particularly novel, so I cannot be the only one who is having this issue. Or I am doing something wrong.

Code to reproduce issue

from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml import Pipeline


# get col names and defintions
string_cols = [c for c, t in df.dtypes if t =='string']
string_index = [f"{s}_index" for s in string_cols]
numeric_cols = [c for c, t in df.dtypes if t !='string']
numeric_cols.remove('objective')


stringIndexer = StringIndexer(inputCols=string_cols, outputCols=string_index, handleInvalid="keep")
featurizer = VectorAssembler(inputCols=numeric_cols+string_index, outputCol="features", handleInvalid="keep")

data_pipeline = Pipeline(stages= [stringIndexer, featurizer])

data = data_pipeline.fit(df).transform(df)["objective", "features"]

# split into train and test  
train, test = data.randomSplit([0.90, 0.10], seed=1)

from synapse.ml.lightgbm import LightGBMClassifier
param = {
             'featuresCol':"features", 
            'labelCol':"objective",
            'zeroAsMissing': False,
            'objective': 'binary',
            'metric': 'binary',
            'verbosity': 0,
            'isUnbalance': True,
            'useBarrierExecutionMode':True, #fix for a known issue see details https://github.com/microsoft/SynapseML/issues/1534
            'learningRate': 0.019960206745150144,
             'posBaggingFraction': 0.741400512824773,
             'negBaggingFraction': 0.9592530174926162,
             'lambdaL1': 7.222372408024596e-07,
             'lambdaL2': 8.048479891726644e-08,
             'numLeaves': 231,
             'featureFraction': 0.7013476730404191,
             'baggingFraction': 0.9473274453520037,
             'baggingFreq': 7,
             'minDataInLeaf': 30,

        }
  

lgb_class = LightGBMClassifier(**param)
model = lgb_class.fit(train)        
predictions = model.transform(test)

import numpy as np
prediction_probability = np.array(predictions.select('probability').collect())

Other info / logs


Py4JJavaError Traceback (most recent call last) in 1 import numpy as np ----> 2 f = np.array(predictions.select('probability').collect())

/databricks/spark/python/pyspark/sql/dataframe.py in collect(self) 713 # Default path used in OSS Spark / for non-DF-ACL clusters: 714 with SCCallSiteSync(self._sc) as css: --> 715 sock_info = self._jdf.collectToPython() 716 return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer()))) 717

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in call(self, *args) 1302 1303 answer = self.gateway_client.send_command(command) -> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 1306

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 115 def deco(*a, **kw): 116 try: --> 117 return f(*a, **kw) 118 except py4j.protocol.Py4JJavaError as e: 119 converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o11576.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 73 in stage 55.0 failed 4 times, most recent failure: Lost task 73.3 in stage 55.0 (TID 21878) (10.30.252.53 executor 32): org.apache.spark.SparkException: Failed to execute user defined function (LightGBMClassificationModel$$Lambda$7979/1876654359: (struct<type:tinyint,size:int,indices:array,values:array>) => struct<type:tinyint,size:int,indices:array,values:array>) at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:168) at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80) at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156) at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.Task.run(Task.scala:95) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:825) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1655) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:828) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.UnsatisfiedLinkError: com.microsoft.ml.lightgbm.lightgbmlibJNI.voidpp_handle()J at com.microsoft.ml.lightgbm.lightgbmlibJNI.voidpp_handle(Native Method) at com.microsoft.ml.lightgbm.lightgbmlib.voidpp_handle(lightgbmlib.java:628) at com.microsoft.azure.synapse.ml.lightgbm.booster.BoosterHandler$.com$microsoft$azure$synapse$ml$lightgbm$booster$BoosterHandler$$createBoosterPtrFromModelString(LightGBMBooster.scala:42) at com.microsoft.azure.synapse.ml.lightgbm.booster.BoosterHandler.(LightGBMBooster.scala:64) at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.boosterHandler$lzycompute(LightGBMBooster.scala:237) at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.boosterHandler(LightGBMBooster.scala:232) at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.score(LightGBMBooster.scala:396) at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassificationModel.predictProbability(LightGBMClassifier.scala:178) at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassificationModel.$anonfun$transform$4(LightGBMClassifier.scala:138) ... 23 more

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3029) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2976) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2970) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2970) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1390) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1390) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1390) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3238) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3179) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3167) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1152) at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2638) at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:241) at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:276) at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:81) at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:87) at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:75) at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:62) at org.apache.spark.sql.execution.ResultCacheManager.collectResult$1(ResultCacheManager.scala:611) at org.apache.spark.sql.execution.ResultCacheManager.computeResult(ResultCacheManager.scala:618) at org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:561) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:560) at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:457) at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:436) at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:422) at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3739) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3951) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:240) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:388) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:187) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968) at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:142) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:338) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3949) at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3737) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.SparkException: Failed to execute user defined function (LightGBMClassificationModel$$Lambda$7979/1876654359: (struct<type:tinyint,size:int,indices:array,values:array>) => struct<type:tinyint,size:int,indices:array,values:array>) at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:168) at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80) at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156) at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.Task.run(Task.scala:95) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:825) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1655) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:828) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more Caused by: java.lang.UnsatisfiedLinkError: com.microsoft.ml.lightgbm.lightgbmlibJNI.voidpp_handle()J at com.microsoft.ml.lightgbm.lightgbmlibJNI.voidpp_handle(Native Method) at com.microsoft.ml.lightgbm.lightgbmlib.voidpp_handle(lightgbmlib.java:628) at com.microsoft.azure.synapse.ml.lightgbm.booster.BoosterHandler$.com$microsoft$azure$synapse$ml$lightgbm$booster$BoosterHandler$$createBoosterPtrFromModelString(LightGBMBooster.scala:42) at com.microsoft.azure.synapse.ml.lightgbm.booster.BoosterHandler.(LightGBMBooster.scala:64) at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.boosterHandler$lzycompute(LightGBMBooster.scala:237) at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.boosterHandler(LightGBMBooster.scala:232) at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.score(LightGBMBooster.scala:396) at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassificationModel.predictProbability(LightGBMClassifier.scala:178) at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassificationModel.$anonfun$transform$4(LightGBMClassifier.scala:138) ... 23 more

What component(s) does this bug affect?

  • [ ] area/cognitive: Cognitive project
  • [ ] area/core: Core project
  • [ ] area/deep-learning: DeepLearning project
  • [ ] area/lightgbm: Lightgbm project
  • [ ] area/opencv: Opencv project
  • [ ] area/vw: VW project
  • [ ] area/website: Website
  • [ ] area/build: Project build system
  • [ ] area/notebooks: Samples under notebooks folder
  • [ ] area/docker: Docker usage
  • [ ] area/models: models related issue

What language(s) does this bug affect?

  • [ ] language/scala: Scala source code
  • [X] language/python: Pyspark APIs
  • [ ] language/r: R APIs
  • [ ] language/csharp: .NET APIs
  • [ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • [ ] integrations/synapse: Azure Synapse integrations
  • [ ] integrations/azureml: Azure ML integrations
  • [X] integrations/databricks: Databricks integrations

AB#1911075

timpiperseek avatar Aug 03 '22 07:08 timpiperseek

Hey @timpiperseek :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

github-actions[bot] avatar Aug 03 '22 07:08 github-actions[bot]

Hi @timpiperseek, thank you for reporting this issue.

To help us repro this issue, can you please share the dataset you used? (df)

Thanks!

JessicaXYWang avatar Aug 04 '22 06:08 JessicaXYWang

@timpiperseek Could you please try version: 0.10.0-6-4868e8bf-SNAPSHOT to see if the error still exists?

serena-ruan avatar Aug 04 '22 08:08 serena-ruan

Hi, I'm having the same problem with SynapseML (0.10.0) LightGBM on Databricks.

I previously thought I was getting this error each time I used a bigger dataset but I recently found out that I'm getting the error randomly... I've run exactly the same notebook 6 times in a row and I got the error 2 times, while for the other 4 runs I was able to successfully train the model and obtain predictions. However, when I was trying to use these saved models, later on, to make predictions again (on the same test dataset), I got the error 5 times in a row and gave up.

I get the error regardless of model parameters. Training and saving the model is always successful but I get the error during the prediction phase. Each time I'm evaluating my model 2 times right after training and sometimes the error occurs only for the last step (so I train the model, I'm able to obtain predictions/metric value once, and then I get the error while evaluating the model for the second time).

I'm using a very simplified version of my code so that I don't think the issue is with the code itself:

from synapse.ml.lightgbm import LightGBMClassifier
from pyspark.ml.evaluation import RegressionEvaluator

# TRAIN PHASE
data = spark.table('train_dataset')

model_params = {
    'featuresCol': 'features',
    'labelCol': TARGET_COL,
}
model = LightGBMClassifier(**model_params)
model = model.fit(data)

# TEST PHASE
test = spark.table('test_dataset')
evaluator = RegressionEvaluator(predictionCol='prediction', labelCol=TARGET_COL, metricName="mae")
agg_evaluator = RegressionEvaluator(predictionCol='agg_prediction', labelCol=AGG_TARGET_COL, metricName="mae")

predictions = model.transform(test)

# I usually get the error here 
mae = evaluator.evaluate(predictions) 

# I'm doing some aggregations on my dataset and obtained predictions - for my use case it makes sense
agg_predictions = predictions.groupby(GROUPBY_COL).agg(...)  
# And sometimes I get the error here, while the previous mae calculation is successful
agg_mae = agg_evaluator.evaluate(agg_predictions) 

My train dataset has approx. 30 mln rows and around 20 features, and the test dataset is 3 times smaller. However, I also got the error a couple of times when I was just testing my solution on smaller data samples e.g. 10 000 rows (for both training and testing) so I also don't think it's due to dataset size.

I also have one additional observation, though I don't know whether it's relevant in any way. When I start the cluster, and the first thing I do is run the prediction code only (using the previously trained and saved model), I seem to always get the error. But then, when I run the training at least once, and I try to run the prediction notebook again after that, I'm able to obtain predictions. Weird, but happened to me 2 or 3 times already. So it looked like this (I was running notebooks in this order):

predict.ipynb --> error
predict.ipynb --> error
...
predict.ipynb --> error
train_and_predict.ipynb --> success
predict.ipynb --> success
predict.ipynb --> success

But the error occurs also randomly even if the cluster is running all the time and I'm training one model after another. I was trying to inspect what was going on with my cluster when the error occurs and I found out that I got the error when the cluster was automatically downscaling (reducing the number of workers). Though I'm also not sure about this observation as I examined only a couple of cases.

I would be grateful for any kind of help! I tried to install 0.10.0-6-4868e8bf-SNAPSHOT version but it failed during installation:

Error Code: DRIVER_LIBRARY_INSTALLATION_FAILURE. Error Message: Library resolution failed. 
Cause: java.lang.RuntimeException: unresolved dependency: com.microsoft.ml.spark:mmlspark_2.12:0.10.0-6-4868e8bf-SNAPSHOT: 
not found at com.databricks.libraries.server.MavenInstaller.doDownloadMavenPackages(MavenLibraryResolver.scala:464)

martamaslankowska avatar Aug 05 '22 07:08 martamaslankowska

@martamaslankowska Hi please add a resolver of https://mmlspark.azureedge.net/maven when you try to install version 0.10.0-6-4868e8bf-SNAPSHOT. A quick sample in pyspark would be like:

import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp")
        .config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.10.0-6-4868e8bf-SNAPSHOT")
        .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
        .getOrCreate()

And please refer to our website in case you need to install on other platforms.

serena-ruan avatar Aug 05 '22 08:08 serena-ruan

Hi, thank you for the quick response. I was using the resolver already as I followed the Databricks installation steps. So it looked like this in my case:

synapseml-snapshot-installation

martamaslankowska avatar Aug 05 '22 08:08 martamaslankowska

The coordinate is not correct, please use com.microsoft.azure:synapseml_2.12:0.10.0-6-4868e8bf-SNAPSHOT. We have been renamed to SynapseML 😃

serena-ruan avatar Aug 05 '22 08:08 serena-ruan

Yes, my bad, thank you for clarifying this 😄 I previously used the correct com.microsoft.azure:synapseml_2.12:0.10.0 but for some reason mixed the links now.

I was able to successfully install this snapshot and run my predict.ipynb notebook just after starting the cluster without getting the error - so it looks promising. I'll try to train my model a couple of times and let you know whether I'm still encountering this problem.

martamaslankowska avatar Aug 05 '22 08:08 martamaslankowska

Ok, so I've been testing it for the last several hours and it looks like the problem is gone! ❤️

I've run my prediction notebook over 20 times and all runs were successful. I've also run the train_and_predict notebook 20 times (10 times with the same params, and 10 times experimenting with different configurations) and didn't encounter the error anymore. Yesterday I've trained 55 models from which 14 ended with the error (25%) with the maximum non-error runs in a row being 8. So I consider today's 20 new trained models without any error as a great success 😄

The only issue with the models now is that they aren't deterministic (I trained 10 identical models, having set all seeds and deterministic parameter to True and still obtained slightly different results) - but the issue with the UnsatisfiedLinkError is finally gone, so that's terrific 😊

So I don't know how about @timpiperseek but the com.microsoft.azure:synapseml_2.12:0.10.0-6-4868e8bf-SNAPSHOT version worked for me. Thanks a lot! 😃

martamaslankowska avatar Aug 05 '22 13:08 martamaslankowska

Yes thank you, it also looks like it is gone for me to.

timpiperseek avatar Aug 07 '22 10:08 timpiperseek

as a side note is it possible to pickle the model object or is the only option to use the saveNativeModel method? Just asking as using as a pickle object would be easier with my current workflow

timpiperseek avatar Aug 07 '22 10:08 timpiperseek

Hi @serena-ruan, I have just one more question regarding this 4868e8bf fix. Are you maybe able to release this version somehow (like a patch version)? Because I'd really like to use LightGBM in my production code but I have concerns about using the snapshot version.

martamaslankowska avatar Aug 09 '22 05:08 martamaslankowska

So the above error is gone, but I suspect that something is still wrong when training on large data sets of 100 million plus. But I cannot put a finger on what is going wrong. It maybe just related to https://github.com/microsoft/SynapseML/issues/1534. But I keep either getting Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(26, 17) finished unsuccessfully. or if it passes the fit stage the evaluation metrics coming back are terrible.

timpiperseek avatar Aug 09 '22 11:08 timpiperseek

Hi @serena-ruan, I have just one more question regarding this 4868e8bf fix. Are you maybe able to release this version somehow (like a patch version)? Because I'd really like to use LightGBM in my production code but I have concerns about using the snapshot version.

Normally we cut versions at some point and for example the next version would be v0.10.1 and that should include this fix. Loop in @mhamilton723 Mark to see if we want to support a special version by cutting the SNAPSHOT suffix?

serena-ruan avatar Aug 12 '22 03:08 serena-ruan

So the above error is gone, but I suspect that something is still wrong when training on large data sets of 100 million plus. But I cannot put a finger on what is going wrong. It maybe just related to #1534. But I keep either getting Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(26, 17) finished unsuccessfully. or if it passes the fit stage the evaluation metrics coming back are terrible.

Sorry to here this, looping in @imatiach-msft and @svotaw who are the owner and master in lightgbm area. Could you help take a look, and if the issue is similar to #1534 we can move to that issue. Thanks guys!

serena-ruan avatar Aug 12 '22 03:08 serena-ruan

okay so I think I have found the issue on my end. It is all good now

timpiperseek avatar Aug 12 '22 03:08 timpiperseek

okay so I think I have found the issue on my end. It is all good now

Sounds good!

serena-ruan avatar Aug 12 '22 03:08 serena-ruan

Hey @timpiperseek @martamaslankowska we cut a new version 0.10.1, please let us know if this solves your issues thanks for your patience!

mhamilton723 avatar Aug 25 '22 23:08 mhamilton723

we cut a new version 0.10.1, please let us know if this solves your issues

Hi, thanks a lot for the new version! I just checked it out and it seems to be working - so that indeed solves my problem 😊

martamaslankowska avatar Aug 29 '22 07:08 martamaslankowska

awesome, closing this issue for now then. :)

svotaw avatar Sep 12 '22 18:09 svotaw