SynapseML version
1.0.5
System information
- Language version ( python 3.11, scala 2.12):
- Spark Version (3.5.0):
- Spark Platform (e.g. Synapse, Databricks):
Describe the problem
I am training the Lightgbm regressor and I am encountering an issue when running the fit with different hyperparameters (with default hyper parameters runs fine) , particularly using the rf boosting type:
Py4JJavaError: An error occurred while calling o10868.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10572.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10572.0 (TID 256860) (172.23.49.171 executor 42): java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:613)
at java.net.Socket.connect(Socket.java:561)
at java.net.Socket.(Socket.java:457)
at java.net.Socket.(Socket.java:234)
at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getNetworkTopologyInfoFromDriver(NetworkManager.scala:133)
at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$2(NetworkManager.scala:120)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:24)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$1(NetworkManager.scala:115)
at com.microsoft.azure.synapse.ml.core.env.StreamUtilities$.using(StreamUtilities.scala:28)
at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getGlobalNetworkInfo(NetworkManager.scala:111)
at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.initialize(BasePartitionTask.scala:197)
at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:132)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:615)
at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:226)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:936)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:936)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:412)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:409)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:376)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:412)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:409)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:376)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:211)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:190)
at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:155)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:45)
at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:104)
at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:109)
at scala.util.Using$.resource(Using.scala:269)
at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:108)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:149)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:101)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$10(Executor.scala:1013)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:106)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:1016)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:903)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.$anonfun$failJobAndIndependentStages$1(DAGScheduler.scala:3991)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3989)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3903)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3890)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3890)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1756)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1739)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1739)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:4249)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:4152)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:4138)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:55)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$runJob$1(DAGScheduler.scala:1403)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1391)
at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:3124)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$runSparkJobs$1(Collector.scala:303)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:299)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$collect$1(Collector.scala:384)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:381)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:122)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:131)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:94)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:90)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:78)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.$anonfun$computeResult$1(ResultCacheManager.scala:555)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.collectResult$1(ResultCacheManager.scala:546)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.$anonfun$computeResult$2(ResultCacheManager.scala:561)
at org.apache.spark.sql.execution.adaptive.ResultQueryStageExec.$anonfun$doMaterialize$1(QueryStageExec.scala:663)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1184)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$8(SQLExecution.scala:874)
at com.databricks.util.LexicalThreadLocal$Handle.runWith(LexicalThreadLocal.scala:63)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$7(SQLExecution.scala:874)
at com.databricks.util.LexicalThreadLocal$Handle.runWith(LexicalThreadLocal.scala:63)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$6(SQLExecution.scala:874)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$5(SQLExecution.scala:873)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$4(SQLExecution.scala:872)
at com.databricks.sql.transaction.tahoe.ConcurrencyHelpers$.withOptimisticTransaction(ConcurrencyHelpers.scala:57)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$3(SQLExecution.scala:871)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:187)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$2(SQLExecution.scala:870)
at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:97)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:855)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.$anonfun$run$1(SparkThreadLocalForwardingThreadPoolExecutor.scala:134)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.IdentityClaim$.withClaim(IdentityClaim.scala:48)
at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.$anonfun$runWithCaptured$4(SparkThreadLocalForwardingThreadPoolExecutor.scala:91)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:45)
at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:90)
at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured$(SparkThreadLocalForwardingThreadPoolExecutor.scala:67)
at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:131)
at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.run(SparkThreadLocalForwardingThreadPoolExecutor.scala:134)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:613)
at java.net.Socket.connect(Socket.java:561)
at java.net.Socket.(Socket.java:457)
at java.net.Socket.(Socket.java:234)
at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getNetworkTopologyInfoFromDriver(NetworkManager.scala:133)
at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$2(NetworkManager.scala:120)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:24)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$1(NetworkManager.scala:115)
at com.microsoft.azure.synapse.ml.core.env.StreamUtilities$.using(StreamUtilities.scala:28)
at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getGlobalNetworkInfo(NetworkManager.scala:111)
at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.initialize(BasePartitionTask.scala:197)
at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:132)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:615)
at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:226)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:936)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:936)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:412)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:409)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:376)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:412)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:409)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:376)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:211)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:190)
at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:155)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:45)
at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:104)
at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:109)
at scala.util.Using$.resource(Using.scala:269)
at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:108)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:149)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:101)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$10(Executor.scala:1013)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:106)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:1016)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:903)
... 3 more
File , line 4
1 model_spark = LightGBMRegressor(**model_params)
3 # Fit the LightGBM model to the Spark DataFrame containing features and labels
----> 4 model_fit = model_spark.fit(
5 spark_df_train.select('features', 'label').limit(spark_df_train.count()))
File /databricks/python_shell/dbruntime/MLWorkloadsInstrumentation/_pyspark.py:30, in _create_patch_function..patched_method(self, *args, **kwargs)
28 call_succeeded = False
29 try:
---> 30 result = original_method(self, *args, **kwargs)
31 call_succeeded = True
32 return result
File /databricks/python/lib/python3.11/site-packages/mlflow/utils/autologging_utils/safety.py:578, in safe_patch..safe_patch_function(*args, **kwargs)
576 patch_function.call(call_original, *args, **kwargs)
577 else:
--> 578 patch_function(call_original, *args, **kwargs)
580 session.state = "succeeded"
582 try_log_autologging_event(
583 AutologgingEventLogger.get_logger().log_patch_function_success,
584 session,
(...)
588 kwargs,
589 )
File /databricks/python/lib/python3.11/site-packages/mlflow/utils/autologging_utils/safety.py:251, in with_managed_run..patch_with_managed_run(original, *args, **kwargs)
248 managed_run = create_managed_run()
250 try:
--> 251 result = patch_function(original, *args, **kwargs)
252 except (Exception, KeyboardInterrupt):
253 # In addition to standard Python exceptions, handle keyboard interrupts to ensure
254 # that runs are terminated if a user prematurely interrupts training execution
255 # (e.g. via sigint / ctrl-c)
256 if managed_run:
File /databricks/python/lib/python3.11/site-packages/mlflow/pyspark/ml/init.py:1140, in autolog..patched_fit(original, self, *args, **kwargs)
1138 if t.should_log():
1139 with _AUTOLOGGING_METRICS_MANAGER.disable_log_post_training_metrics():
-> 1140 fit_result = fit_mlflow(original, self, *args, **kwargs)
1141 # In some cases the fit_result may be an iterator of spark models.
1142 if should_log_post_training_metrics and isinstance(fit_result, Model):
File /databricks/python/lib/python3.11/site-packages/mlflow/pyspark/ml/init.py:1126, in autolog..fit_mlflow(original, self, *args, **kwargs)
1124 input_training_df = args[0].persist(StorageLevel.MEMORY_AND_DISK)
1125 _log_pretraining_metadata(estimator, params, input_training_df)
-> 1126 spark_model = original(self, args, **kwargs)
1127 _log_posttraining_metadata(estimator, spark_model, params, input_training_df)
1128 input_training_df.unpersist()
File /databricks/python/lib/python3.11/site-packages/mlflow/utils/autologging_utils/safety.py:559, in safe_patch..safe_patch_function..call_original(og_args, **og_kwargs)
556 original_result = original(_og_args, **_og_kwargs)
557 return original_result
--> 559 return call_original_fn_with_event_logging(_original_fn, og_args, og_kwargs)
File /databricks/python/lib/python3.11/site-packages/mlflow/utils/autologging_utils/safety.py:494, in safe_patch..safe_patch_function..call_original_fn_with_event_logging(original_fn, og_args, og_kwargs)
485 try:
486 try_log_autologging_event(
487 AutologgingEventLogger.get_logger().log_original_function_start,
488 session,
(...)
492 og_kwargs,
493 )
--> 494 original_fn_result = original_fn(og_args, **og_kwargs)
496 try_log_autologging_event(
497 AutologgingEventLogger.get_logger().log_original_function_success,
498 session,
(...)
502 og_kwargs,
503 )
504 return original_fn_result
File /databricks/python/lib/python3.11/site-packages/mlflow/utils/autologging_utils/safety.py:556, in safe_patch..safe_patch_function..call_original.._original_fn(_og_args, **_og_kwargs)
548 # Show all non-MLflow warnings as normal (i.e. not as event logs)
549 # during original function execution, even if silent mode is enabled
550 # (silent=True), since these warnings originate from the ML framework
551 # or one of its dependencies and are likely relevant to the caller
552 with set_non_mlflow_warnings_behavior_for_current_thread(
553 disable_warnings=False,
554 reroute_warnings=False,
555 ):
--> 556 original_result = original(_og_args, **_og_kwargs)
557 return original_result
File /databricks/spark/python/pyspark/ml/base.py:203, in Estimator.fit(self, dataset, params)
201 return self.copy(params)._fit(dataset)
202 else:
--> 203 return self._fit(dataset)
204 else:
205 raise TypeError(
206 "Params must be either a param map or a list/tuple of param maps, "
207 "but got %s." % type(params)
208 )
File /local_disk0/spark-856bf487-c525-4555-a333-eafb6a65bc1c/userFiles-44420d13-a4c3-453a-888a-4cffcc721795/com_microsoft_azure_synapseml_lightgbm_2_12_1_0_5.jar/synapse/ml/lightgbm/LightGBMRegressor.py:2105, in LightGBMRegressor._fit(self, dataset)
2104 def _fit(self, dataset):
-> 2105 java_model = self._fit_java(dataset)
2106 return self._create_model(java_model)
File /databricks/spark/python/pyspark/ml/wrapper.py:392, in JavaEstimator._fit_java(self, dataset)
389 assert self._java_obj is not None
391 self._transfer_params_to_java()
--> 392 return self._java_obj.fit(dataset._jdf)
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1355, in JavaMember.call(self, *args)
1349 command = proto.CALL_COMMAND_NAME +
1350 self.command_header +
1351 args_command +
1352 proto.END_COMMAND_PART
1354 answer = self.gateway_client.send_command(command)
-> 1355 return_value = get_return_value(
1356 answer, self.gateway_client, self.target_id, self.name)
1358 for temp_arg in temp_args:
1359 if hasattr(temp_arg, "_detach"):
File /databricks/spark/python/pyspark/errors/exceptions/captured.py:248, in capture_sql_exception..deco(*a, **kw)
245 from py4j.protocol import Py4JJavaError
247 try:
--> 248 return f(*a, **kw)
249 except Py4JJavaError as e:
250 converted = convert_exception(e.java_exception)
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
332 format(target_id, ".", name, value))
Code to reproduce issue
model_params = {
'boostingType':'rf'
}
model_spark = LightGBMRegressor(**model_params)
model_fit = model_spark.fit(
spark_df_train.select('features', 'label'))
Other info / logs
No response
What component(s) does this bug affect?
- [ ]
area/cognitive: Cognitive project
- [ ]
area/core: Core project
- [ ]
area/deep-learning: DeepLearning project
- [x]
area/lightgbm: Lightgbm project
- [ ]
area/opencv: Opencv project
- [ ]
area/vw: VW project
- [ ]
area/website: Website
- [ ]
area/build: Project build system
- [ ]
area/notebooks: Samples under notebooks folder
- [ ]
area/docker: Docker usage
- [ ]
area/models: models related issue
What language(s) does this bug affect?
- [ ]
language/scala: Scala source code
- [x]
language/python: Pyspark APIs
- [ ]
language/r: R APIs
- [ ]
language/csharp: .NET APIs
- [ ]
language/new: Proposals for new client languages
What integration(s) does this bug affect?
- [ ]
integrations/synapse: Azure Synapse integrations
- [ ]
integrations/azureml: Azure ML integrations
- [x]
integrations/databricks: Databricks integrations
Any updates on this issue. I had a similar issue when using hyper-parameter tuning while after some number of trials, it shows
"""
py4j.protocol.Py4JJavaError: An error occurred while calling o29621.fit.
java.net.ConnectException: Connection refused
"""
Same issue from my side. I try to combine CrossValidator and synapse.ml.lightgbm together... after a long time fitting, then will come up with connection error or job be aborted due to the stage failure.
I also set the auto scaler for Databricks as fixed, set the spark.dynamicAllocation.enabled false and used the batches. Could any assignee who are responsible for this issue? Thank you so much!
@mhamilton723 @memoryz @mhamilton723 @dylanw-oss @svotaw Hi everyone, could you assign this ticket and solve this?