SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

[LightGBM] java.util.NoSuchElementException with ranker.fit()

Open jovis-gnn opened this issue 2 years ago • 5 comments

SynapseML version

0.10.2

System information

  • Language version (e.g. python 3.8, scala 2.12):
  • Spark Version (e.g. 3.2.3): 3.3.1
  • Spark Platform (e.g. Synapse, Databricks): EMR

Describe the problem

I'm testing lightgbm on EMR cluster. I tried to create sample dataset and fit the dataset to LightGBMRanker model. I've got some errors and it seems to have some problem collecting dataset. Please give me some feedback if you have some idea...

Thank you.

Code to reproduce issue

from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler

spark = (
    SparkSession.builder.appName("jovis")
    .config("spark.sql.session.timeZone", "UTC")
    .config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.10.2")
    .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
    .enableHiveSupport()
    .getOrCreate()
)

train_df = spark.createDataFrame(
    [
        [1.0, 1.0, 1.0, 0, 0.0, False],
        [1.0, 2.0, 2.0, 0, 1.0, False],
        [1.0, 6.0, 13.0, 0, 0.0, False],
        [1.0, 3.0, 14.0, 0, 1.0, False],
        [1.0, 3.0, 12.0, 0, 0.0, False],
        [1.0, 8.0, 6.0, 0, 1.0, False],
        [1.0, 4.0, 4.0, 1, 1.0, False],
        [1.0, 3.0, 8.0, 1, 0.0, False],
        [1.0, 4.0, 3.0, 1, 0.0, False],
        [1.0, 7.0, 2.0, 1, 0.0, False],
        [1.0, 2.0, 1.0, 1, 0.0, False]
    ], 
    ['feat_1', 'feat_2', 'feat_3', 'group_id', 'label', 'validation']
)
featurizer = VectorAssembler(inputCols=['feat_1', 'feat_2', 'feat_3'], outputCol="features")
train_df = featurizer.transform(train_df)

from synapse.ml.lightgbm import LightGBMRanker
ranker = LightGBMRanker(
    labelCol="label",
    featuresCol="features",
    groupCol="group_id",
    validationIndicatorCol="validation",
    objective="lambdarank",
    numLeaves=31,
    numIterations=200,
    metric="map",
    boostingType="gbdt",
    evalAt=[1, 5, 10],
    earlyStoppingRound=10
)

ranker.fit(train_df)

Other info / logs

23/06/02 06:45:41 WARN TaskSetManager: Lost task 0.0 in stage 30.0 (TID 183) (ip-172-31-143-99.ap-northeast-2.compute.internal executor 3): java.util.NoSuchElementException: None.get
	at scala.None$.get(Option.scala:529)
	at scala.None$.get(Option.scala:527)
	at com.microsoft.azure.synapse.ml.lightgbm.dataset.PeekingIterator.peek(DatasetAggregator.scala:113)
	at com.microsoft.azure.synapse.ml.lightgbm.dataset.BaseChunkedColumns.<init>(DatasetAggregator.scala:130)
	at com.microsoft.azure.synapse.ml.lightgbm.dataset.DenseChunkedColumns.<init>(DatasetAggregator.scala:217)
	at com.microsoft.azure.synapse.ml.lightgbm.BulkPartitionTask.getChunkedColumns(BulkPartitionTask.scala:76)
	at com.microsoft.azure.synapse.ml.lightgbm.BulkPartitionTask.$anonfun$preparePartitionDataInternal$1(BulkPartitionTask.scala:47)
	at scala.Option.map(Option.scala:230)
	at com.microsoft.azure.synapse.ml.lightgbm.BulkPartitionTask.preparePartitionDataInternal(BulkPartitionTask.scala:46)
	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.preparePartitionData(BasePartitionTask.scala:210)
	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:121)
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:589)
	at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:201)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:138)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

23/06/02 06:45:42 WARN TaskSetManager: Lost task 0.1 in stage 30.0 (TID 184) (ip-172-31-143-99.ap-northeast-2.compute.internal executor 3): java.net.ConnectException: Connection refused (Connection refused)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:607)
	at java.net.Socket.connect(Socket.java:556)
	at java.net.Socket.<init>(Socket.java:452)
	at java.net.Socket.<init>(Socket.java:229)
	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getNetworkTopologyInfoFromDriver(NetworkManager.scala:129)
	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$2(NetworkManager.scala:116)
	at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:24)
	at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
	at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
	at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
	at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$1(NetworkManager.scala:111)
	at com.microsoft.azure.synapse.ml.core.env.StreamUtilities$.using(StreamUtilities.scala:28)
	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getGlobalNetworkInfo(NetworkManager.scala:107)
	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.initialize(BasePartitionTask.scala:179)
	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:114)
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:589)
	at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:201)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:138)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

23/06/02 06:45:44 ERROR TaskSetManager: Task 0 in stage 30.0 failed 4 times; aborting job
23/06/02 06:45:44 ERROR LightGBMRanker: {"buildVersion":"0.10.2","className":"class com.microsoft.azure.synapse.ml.lightgbm.LightGBMRanker","method":"train","uid":"LightGBMRanker_dfa33d4f60b4"}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 30.0 failed 4 times, most recent failure: Lost task 0.3 in stage 30.0 (TID 186) (ip-172-31-143-99.ap-northeast-2.compute.internal executor 3): java.net.ConnectException: Connection refused (Connection refused)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:607)
	at java.net.Socket.connect(Socket.java:556)
	at java.net.Socket.<init>(Socket.java:452)
	at java.net.Socket.<init>(Socket.java:229)
	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getNetworkTopologyInfoFromDriver(NetworkManager.scala:129)
	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$2(NetworkManager.scala:116)
	at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:24)
	at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
	at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
	at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
	at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29)
	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$1(NetworkManager.scala:111)
	at com.microsoft.azure.synapse.ml.core.env.StreamUtilities$.using(StreamUtilities.scala:28)
	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getGlobalNetworkInfo(NetworkManager.scala:107)
	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.initialize(BasePartitionTask.scala:179)
	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:114)
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:589)
	at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:201)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:138)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2863) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2799) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2798) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) ~[scala-library-2.12.15.jar:?]
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) ~[scala-library-2.12.15.jar:?]
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) ~[scala-library-2.12.15.jar:?]
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2798) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1239) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1239) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at scala.Option.foreach(Option.scala:407) ~[scala-library-2.12.15.jar:?]
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1239) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3051) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2993) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2982) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1009) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2229) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2250) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2269) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2294) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:406) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1020) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:441) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:483) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:522) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:483) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3932) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:3161) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3922) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:554) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3920) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) ~[spark-catalyst_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:114) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$7(SQLExecution.scala:139) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) ~[spark-catalyst_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:139) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:245) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:138) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3920) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.sql.Dataset.collect(Dataset.scala:3161) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executePartitionTasks(LightGBMBase.scala:597) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executePartitionTasks$(LightGBMBase.scala:583) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMRanker.executePartitionTasks(LightGBMRanker.scala:26) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executeTraining(LightGBMBase.scala:573) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executeTraining$(LightGBMBase.scala:545) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMRanker.executeTraining(LightGBMRanker.scala:26) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.trainOneDataBatch(LightGBMBase.scala:435) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.trainOneDataBatch$(LightGBMBase.scala:392) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMRanker.trainOneDataBatch(LightGBMRanker.scala:26) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$train$2(LightGBMBase.scala:61) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logVerb(BasicLogging.scala:62) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logVerb$(BasicLogging.scala:59) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMRanker.logVerb(LightGBMRanker.scala:26) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logTrain(BasicLogging.scala:48) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logTrain$(BasicLogging.scala:47) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMRanker.logTrain(LightGBMRanker.scala:26) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train(LightGBMBase.scala:42) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train$(LightGBMBase.scala:35) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMRanker.train(LightGBMRanker.scala:26) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMRanker.train(LightGBMRanker.scala:26) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:151) ~[spark-mllib_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_372]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_372]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_372]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_372]
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) ~[py4j-0.10.9.5.jar:?]
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) ~[py4j-0.10.9.5.jar:?]
	at py4j.Gateway.invoke(Gateway.java:282) ~[py4j-0.10.9.5.jar:?]
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) ~[py4j-0.10.9.5.jar:?]
	at py4j.commands.CallCommand.execute(CallCommand.java:79) ~[py4j-0.10.9.5.jar:?]
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) ~[py4j-0.10.9.5.jar:?]
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106) ~[py4j-0.10.9.5.jar:?]
	at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_372]
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_372]
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_372]
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_372]
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_372]
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_372]
	at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_372]
	at java.net.Socket.connect(Socket.java:556) ~[?:1.8.0_372]
	at java.net.Socket.<init>(Socket.java:452) ~[?:1.8.0_372]
	at java.net.Socket.<init>(Socket.java:229) ~[?:1.8.0_372]
	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getNetworkTopologyInfoFromDriver(NetworkManager.scala:129) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$2(NetworkManager.scala:116) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:24) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.core.utils.FaultToleranceUtils$.retryWithTimeout(FaultToleranceUtils.scala:29) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.$anonfun$getGlobalNetworkInfo$1(NetworkManager.scala:111) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.core.env.StreamUtilities$.using(StreamUtilities.scala:28) ~[com.microsoft.azure_synapseml-core_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.getGlobalNetworkInfo(NetworkManager.scala:107) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.initialize(BasePartitionTask.scala:179) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:114) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:589) ~[com.microsoft.azure_synapseml-lightgbm_2.12-0.10.2.jar:0.10.2]
	at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:201) ~[spark-sql_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.scheduler.Task.run(Task.scala:138) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.1-amzn-0.jar:3.3.1-amzn-0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_372]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_372]
	... 1 more

What component(s) does this bug affect?

  • [ ] area/cognitive: Cognitive project
  • [ ] area/core: Core project
  • [ ] area/deep-learning: DeepLearning project
  • [ ] area/lightgbm: Lightgbm project
  • [ ] area/opencv: Opencv project
  • [ ] area/vw: VW project
  • [ ] area/website: Website
  • [ ] area/build: Project build system
  • [ ] area/notebooks: Samples under notebooks folder
  • [ ] area/docker: Docker usage
  • [ ] area/models: models related issue

What language(s) does this bug affect?

  • [ ] language/scala: Scala source code
  • [ ] language/python: Pyspark APIs
  • [ ] language/r: R APIs
  • [ ] language/csharp: .NET APIs
  • [ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • [ ] integrations/synapse: Azure Synapse integrations
  • [ ] integrations/azureml: Azure ML integrations
  • [ ] integrations/databricks: Databricks integrations

jovis-gnn avatar Jun 02 '23 07:06 jovis-gnn

Hey @jovis-gnn :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

github-actions[bot] avatar Jun 02 '23 07:06 github-actions[bot]

Thanks, jovis-gnn for reporting this.

The exception NoSuchElementException seems to be the result of not being able to compute an RDD partition or read it from a checkpoint because the attempt to connect to the Spark driver is failing with exception java.net.ConnectException. Can you please check along this line on the EMR cluster? In the meantime, we will investigate this further on our end because, I also see a possibility to address the second half of this scenario on our end.

@svotaw can you also please take a look? Looks like we need to handle NoSuchElementException in Data Aggregator.

saileshbaidya avatar Jun 05 '23 06:06 saileshbaidya

@saileshbaidya Thanks for reply I found the reason of this Exception. Even though there were no "True" values in validation column(sample data) but I specified validationIndicatorCol, so lightgbm module returned NoSuchElementException. After changing some of those values to "True", it resolved.

By the way, after solving the problem, data collecting stage pending for long time(never end). But after repartitioning the training set to 1, it worked. Could you help me about the reason(minimum / maximum number of partition for training or something)?

jovis-gnn avatar Jun 06 '23 15:06 jovis-gnn

DataAggregator is being deprecated, so we won't mess with that. The newer "streaming" mode is available in our latest releases. Please ask for a copy if you want to try that (no official version yet with latest fixes).

LightGBM algorithm does not work with auto-scaled clusters, so please turn off any scaling. Also, it helps to set "spark.dynamicAllocation.enabled": "false".

You are likely hitting scaling problems which affect networking (which look like hangs). By repartitioning to 1, you are removing networking (only using 1 node). You can try smaller numbers to improve hangs with the version you have.

svotaw avatar Jul 10 '23 18:07 svotaw

We have released 11.2, which has the final streaming features.

svotaw avatar Jul 17 '23 19:07 svotaw