oap-mllib [K8S] Encounter executors restart issue, when run Hibench Kmeans workload on AWS EKS environment, with Hibench small scale dataset

We are integrating OAP MLlib into Cloud. Encountered the executors restart issue, although the result still outputted , when run Hibench Kmeans workload on AWS EKS environment, with Hibench small scale dataset.

Spark confs and command:

sh ./spark-submit-client.sh --conf spark.executor.instances=2 \
  --conf spark.executor.memoryOverhead=512m  \
  --conf spark.executor.cores=1  \
  --conf spark.default.parallelism=2  \
  --conf spark.sql.shuffle.partitions=2  \
  --conf spark.driver.extraClassPath=/opt/home/conda/envs/oap-1.2.0/oap_jars/oap-mllib-1.2.0.jar \
  --conf spark.executor.extraClassPath=/opt/home/conda/envs/oap-1.2.0/oap_jars/oap-mllib-1.2.0.jar \
  --class com.intel.hibench.sparkbench.ml.DenseKMeans /opt/home/sparkbench-assembly-8.0-SNAPSHOT-dist.jar -k 10 --numIterations 5 --storageLevel MEMORY_ONLY --initMode Random s3a://aws-emr-resources-348941870272-us-east-2/datagen/HiBench/Kmeans/Input/3000000/samples

data scale conf

hibench.kmeans.small.num_of_clusters		5
hibench.kmeans.small.dimensions			20
hibench.kmeans.small.num_of_samples		3000000
hibench.kmeans.small.samples_per_inputfile	600000
hibench.kmeans.small.max_iteration		5
hibench.kmeans.small.k				10

logs

21/07/30 13:01:25 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on 192.168.81.251:46745 (size: 16.8 KiB, free: 107.0 MiB)
21/07/30 13:02:25 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 1.
21/07/30 13:02:25 INFO DAGScheduler: Executor lost: 1 (epoch 1)
21/07/30 13:02:25 INFO TaskSetManager: Finished task 0.0 in stage 9.0 (TID 107) in 59757 ms on 192.168.32.242 (executor 2) (1/2)
21/07/30 13:02:25 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_6 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_37_8 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_9 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_15 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_3 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_37_14 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_11 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_37_16 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_14 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_1 !
...
21/07/30 06:37:18 INFO DAGScheduler: Job 10 finished: collect at ClusteringSummary.scala:49, took 124.838207 s
21/07/30 06:37:18 INFO CodeGenerator: Code generated in 17.830358 ms
21/07/30 06:37:18 INFO Instrumentation: [d644dcde] {"clusterSizes":"[210265,121453,427029,123869,684244,207123,389735,600544,118074,117664]"}
21/07/30 06:37:18 INFO Instrumentation: [d644dcde] training finished
Training time (ms) = 744550
Total cost = 1.3657241593861768E11.
21/07/30 06:37:18 INFO SparkUI: Stopped Spark web UI at http://spark-client-headless-service.default.svc.cluster.local:4040

But when vanilla spark ran this workload with the same confs, everything is ok.

Vanilla spark command:

sh ./spark-submit-client.sh --conf spark.executor.instances=2 \
  --conf spark.executor.memoryOverhead=512m  \
  --conf spark.executor.cores=1  \
  --conf spark.default.parallelism=2  \
  --conf spark.sql.shuffle.partitions=2  \
  --class com.intel.hibench.sparkbench.ml.DenseKMeans /opt/home/sparkbench-assembly-8.0-SNAPSHOT-dist.jar -k 10 --numIterations 5 --storageLevel MEMORY_ONLY --initMode Random s3a://aws-emr-resources-348941870272-us-east-2/datagen/HiBench/Kmeans/Input/3000000/samples

Vanilla spark running log

21/08/02 07:18:45 INFO TaskSetManager: Finished task 16.0 in stage 14.0 (TID 187) in 190 ms on 192.168.57.153 (executor 1) (17/20)
21/08/02 07:18:45 INFO TaskSetManager: Finished task 18.0 in stage 14.0 (TID 188) in 138 ms on 192.168.85.148 (executor 2) (18/20)
21/08/02 07:18:45 INFO TaskSetManager: Starting task 19.0 in stage 14.0 (TID 190) (192.168.57.153, executor 1, partition 19, PROCESS_LOCAL, 4572 bytes) taskResourceAssignments Map()
21/08/02 07:18:45 INFO TaskSetManager: Finished task 17.0 in stage 14.0 (TID 189) in 211 ms on 192.168.57.153 (executor 1) (19/20)
21/08/02 07:18:45 INFO TaskSetManager: Finished task 19.0 in stage 14.0 (TID 190) in 105 ms on 192.168.57.153 (executor 1) (20/20)
21/08/02 07:18:45 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all completed, from pool
21/08/02 07:18:45 INFO DAGScheduler: ShuffleMapStage 14 (collect at ClusteringSummary.scala:49) finished in 2.552 s
21/08/02 07:18:45 INFO DAGScheduler: looking for newly runnable stages
21/08/02 07:18:45 INFO DAGScheduler: running: Set()
21/08/02 07:18:45 INFO DAGScheduler: waiting: Set(ResultStage 15)
21/08/02 07:18:45 INFO DAGScheduler: failed: Set()
21/08/02 07:18:45 INFO DAGScheduler: Submitting ResultStage 15 (MapPartitionsRDD[50] at collect at ClusteringSummary.scala:49), which has no missing parents
21/08/02 07:18:45 INFO MemoryStore: Block broadcast_26 stored as values in memory (estimated size 50.5 KiB, free 413.4 MiB)
21/08/02 07:18:45 INFO MemoryStore: Block broadcast_26_piece0 stored as bytes in memory (estimated size 22.3 KiB, free 413.4 MiB)
21/08/02 07:18:45 INFO BlockManagerInfo: Added broadcast_26_piece0 in memory on spark-client-headless-service.default.svc.cluster.local:33081 (size: 22.3 KiB, free: 413.8 MiB)
21/08/02 07:18:45 INFO SparkContext: Created broadcast 26 from broadcast at DAGScheduler.scala:1383
21/08/02 07:18:45 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 15 (MapPartitionsRDD[50] at collect at ClusteringSummary.scala:49) (first 15 tasks are for partitions Vector(0, 1))
21/08/02 07:18:45 INFO TaskSchedulerImpl: Adding task set 15.0 with 2 tasks resource profile 0
21/08/02 07:18:45 INFO TaskSetManager: Starting task 0.0 in stage 15.0 (TID 191) (192.168.57.153, executor 1, partition 0, NODE_LOCAL, 4465 bytes) taskResourceAssignments Map()
21/08/02 07:18:45 INFO TaskSetManager: Starting task 1.0 in stage 15.0 (TID 192) (192.168.85.148, executor 2, partition 1, NODE_LOCAL, 4465 bytes) taskResourceAssignments Map()
21/08/02 07:18:45 INFO BlockManagerInfo: Added broadcast_26_piece0 in memory on 192.168.85.148:36707 (size: 22.3 KiB, free: 91.2 MiB)
21/08/02 07:18:45 INFO BlockManagerInfo: Added broadcast_26_piece0 in memory on 192.168.57.153:45589 (size: 22.3 KiB, free: 107.0 MiB)
21/08/02 07:18:45 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 6 to 192.168.85.148:51316
21/08/02 07:18:45 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 6 to 192.168.57.153:55576
21/08/02 07:18:46 INFO TaskSetManager: Finished task 0.0 in stage 15.0 (TID 191) in 76 ms on 192.168.57.153 (executor 1) (1/2)
21/08/02 07:18:46 INFO TaskSetManager: Finished task 1.0 in stage 15.0 (TID 192) in 87 ms on 192.168.85.148 (executor 2) (2/2)
21/08/02 07:18:46 INFO TaskSchedulerImpl: Removed TaskSet 15.0, whose tasks have all completed, from pool
21/08/02 07:18:46 INFO DAGScheduler: ResultStage 15 (collect at ClusteringSummary.scala:49) finished in 0.140 s
21/08/02 07:18:46 INFO DAGScheduler: Job 8 is finished. Cancelling potential speculative or zombie tasks for this job
21/08/02 07:18:46 INFO TaskSchedulerImpl: Killing all running tasks in stage 15: Stage finished
21/08/02 07:18:46 INFO DAGScheduler: Job 8 finished: collect at ClusteringSummary.scala:49, took 2.736976 s
21/08/02 07:18:46 INFO CodeGenerator: Code generated in 12.199455 ms
21/08/02 07:18:46 INFO Instrumentation: [825afca3] {"clusterSizes":"[286782,121453,119044,123869,600000,313218,600000,600000,118074,117560]"}
21/08/02 07:18:46 INFO Instrumentation: [825afca3] training finished
Training time (ms) = 22674
Total cost = 1.9109701621924158E11.
21/08/02 07:18:46 INFO SparkUI: Stopped Spark web UI at http://spark-client-headless-service.default.svc.cluster.local:4040

Aug 02 '21 07:08 zhixingheyi-tian

@xwu99 @carsonwang , please have a check when available.

Thanks

Aug 02 '21 07:08 zhixingheyi-tian

What version are you testing? The master branch has some dependencies of GPU libraries that I am working to fix in #111 Are you using K8S? KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 1. We didn't test on K8S. I am not sure if there is different way to set classpath for K8S. Could you also send out stdout / stderr info?

Aug 03 '21 02:08 xwu-intel

@zhixingheyi-tian Pls try the latest master branch to see if everything ok. Also @haojinIntel Be notice to use ./build.sh -p spark-3.1.1 to build. There is an additional -p comparing to the old version. Pls check updated README for the dependency versions. CPU only version will be built by default.

Aug 04 '21 12:08 xwu-intel

@xwu99

Now our team Cloud integration testings (include on EMR, Dataproc, EKS, etc.) are for branch-1.2. Cloud integration is also the target of 1.2 release. From @yao531441 compile script，have included the " -Pspark-3.1.1"

oap-mllib )
    cd $OAP_HOME/oap-mllib/mllib-dal
    source /opt/intel/oneapi/setvars.sh
    source /tmp/oneCCL/build/_install/env/setvars.sh
    mvn clean package  -Dmaven.test.skip=true  -Pspark-3.1.1
    ;;

"EKS" just is Amazon Elastic Kubernetes Service. We used K8S scheduler to launch Spark Executor. From the "KMeansDAL" log, the oap-mllib jars have enabled in classpath.

The driver log is shown above.

The executor log before killed is as below:

21/08/02 07:37:27 INFO CoarseGrainedExecutorBackend: Got assigned task 106
21/08/02 07:37:27 INFO Executor: Running task 1.0 in stage 9.0 (TID 106)
21/08/02 07:37:27 INFO TorrentBroadcast: Started reading broadcast variable 10 with 1 pieces (estimated total size 4.0 MiB)
21/08/02 07:37:27 INFO MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 16.8 KiB, free 35.9 MiB)
21/08/02 07:37:27 INFO TorrentBroadcast: Reading broadcast variable 10 took 20 ms
21/08/02 07:37:27 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 40.8 KiB, free 35.8 MiB)
21/08/02 07:37:27 INFO BlockManager: Found block rdd_39_1 locally
21/08/02 07:37:27 INFO OneCCL: Initializing with IP_PORT: 192.168.82.1_3000
OneCCL (native): init
2021:08:02-07:37:27:(43) |WARN| hostname is quite long, len: 63, name: sinput3000000samples105memoryonlyrandom-9ae02e7b05c8a699-exec-1
KVS get error: timeout limit (60 > 60), prefix: CCL_POD_ADDR1, key atl-ofi-hostname-0

I doubt if there are some limitations for Spark configurations. For example: "executor.memory". Now ,test dataset is small scale of HiBench, just less than 1G. Does OAP MLlib need specific conf items for small datasets?

Aug 05 '21 07:08 zhixingheyi-tian

@zhixingheyi-tian From the log, it's not related to memory. oneCCL ranks are not about to connect to the rank 0 which is the executor listening in 192.168.82.1 port 3000. There is some network issue. Maybe the port is blocked or the IP is not reachable. And there is a warning: |WARN| hostname is quite long, len: 63, name: sinput3000000samples105memoryonlyrandom-9ae02e7b05c8a699-exec-1. I am not sure if it's the cause of bad hostname resolve. Could you check the DNS, hostname, firewall, routing etc. network related configs ?

Aug 05 '21 14:08 xwu-intel

btw. It's encouraged to use build.sh instead of just using mvn to build. Some additional environments need to be set before calling mvn.

Aug 05 '21 14:08 xwu-intel

@xwu99 As baseline, we also started the vanilla Spark testing, with same configs and same workload. The running is smooth and quick. Can refer to previous description. So the network issue can be excluded.

Aug 06 '21 12:08 zhixingheyi-tian

Add "spark.shuffle.reduceLocality.enabled=false" conf item, We verify two cases: two executors, and one executors, still exist issues.

Two executors case, encounter hang issue, on both driver and executor sides.

sh ./spark-submit-client.sh --conf spark.executor.instances=2 \
  --conf spark.executor.memoryOverhead=512m  \
  --conf spark.executor.cores=1  \
  --conf spark.default.parallelism=2  \
  --conf spark.sql.shuffle.partitions=2  \
  --conf spark.shuffle.reduceLocality.enabled=false  \
  --conf spark.driver.extraClassPath=/opt/home/conda/envs/oap-1.2.0.emr630/oap_jars/oap-mllib-1.2.0.jar \
  --conf spark.executor.extraClassPath=/opt/home/conda/envs/oap-1.2.0.emr630/oap_jars/oap-mllib-1.2.0.jar \
  --class com.intel.hibench.sparkbench.ml.DenseKMeans /opt/home/tools/HiBench/sparkbench/assembly/target/sparkbench-assembly-8.0-SNAPSHOT-dist.jar -k 10 --numIterations 5 --storageLevel MEMORY_ONLY --initMode Random s3a://aws-emr-resources-348941870272-us-east-2/datagen/HiBench/Kmeans/Input/3000000/samples

driver hang log

21/08/10 08:06:16 INFO BlockManagerInfo: Removed broadcast_9_piece0 on spark-client-headless-service.default.svc.cluster.local:44815 in memory (size: 15.0 KiB, free: 413.9 MiB)
21/08/10 08:06:16 INFO BlockManagerInfo: Removed broadcast_8_piece0 on 192.168.47.215:42623 in memory (size: 1278.0 B, free: 111.0 MiB)
21/08/10 08:06:16 INFO BlockManagerInfo: Removed broadcast_8_piece0 on spark-client-headless-service.default.svc.cluster.local:44815 in memory (size: 1278.0 B, free: 413.9 MiB)
21/08/10 08:06:16 INFO BlockManagerInfo: Removed broadcast_8_piece0 on 192.168.11.77:41423 in memory (size: 1278.0 B, free: 87.3 MiB)
21/08/10 08:32:18 INFO BlockManagerInfo: Removed broadcast_1_piece0 on spark-client-headless-service.default.svc.cluster.local:44815 in memory (size: 11.4 KiB, free: 413.9 MiB)
21/08/10 08:32:18 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.47.215:42623 on disk (size: 11.4 KiB)
21/08/10 08:32:18 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.11.77:41423 on disk (size: 11.4 KiB)

executor hang log

21/08/10 08:05:50 INFO Executor: Running task 1.0 in stage 9.0 (TID 107)
21/08/10 08:05:50 INFO TorrentBroadcast: Started reading broadcast variable 10 with 1 pieces (estimated total size 4.0 MiB)
21/08/10 08:05:50 INFO MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 16.8 KiB, free 110.5 MiB)
21/08/10 08:05:50 INFO TorrentBroadcast: Reading broadcast variable 10 took 13 ms
21/08/10 08:05:50 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 40.9 KiB, free 110.5 MiB)
21/08/10 08:05:50 INFO BlockManager: Found block rdd_39_1 locally
21/08/10 08:05:50 INFO OneCCL: Initializing with IP_PORT: 192.168.47.215_3000
OneCCL (native): init

One executor case, encounter executor restart issue without error log. And only restarted once. And the job completed with the second exector

sh ./spark-submit-client.sh --conf spark.executor.instances=1 \
  --conf spark.executor.memoryOverhead=512m  \
  --conf spark.executor.cores=1  \
  --conf spark.default.parallelism=1  \
  --conf spark.sql.shuffle.partitions=1  \
  --conf spark.shuffle.reduceLocality.enabled=false  \
  --conf spark.driver.extraClassPath=/opt/home/conda/envs/oap-1.2.0.emr630/oap_jars/oap-mllib-1.2.0.jar \
  --conf spark.executor.extraClassPath=/opt/home/conda/envs/oap-1.2.0.emr630/oap_jars/oap-mllib-1.2.0.jar \
  --class com.intel.hibench.sparkbench.ml.DenseKMeans /opt/home/tools/HiBench/sparkbench/assembly/target/sparkbench-assembly-8.0-SNAPSHOT-dist.jar -k 10 --numIterations 5 --storageLevel MEMORY_ONLY --initMode Random s3a://aws-emr-resources-348941870272-us-east-2/datagen/HiBench/Kmeans/Input/3000000/samples

Driver log

21/08/10 07:31:38 INFO TaskSetManager: Finished task 11.0 in stage 6.0 (TID 100) in 5095 ms on 192.168.5.51 (executor 2) (18/20)
21/08/10 07:31:43 INFO BlockManagerInfo: Added rdd_6_15 in memory on 192.168.5.51:43073 (size: 19.7 MiB, free: 142.4 MiB)
21/08/10 07:31:44 INFO BlockManagerInfo: Added rdd_37_15 in memory on 192.168.5.51:43073 (size: 24.0 B, free: 142.4 MiB)
21/08/10 07:31:44 INFO TaskSetManager: Finished task 15.0 in stage 6.0 (TID 101) in 5313 ms on 192.168.5.51 (executor 2) (19/20)
21/08/10 07:32:44 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 137201 ms exceeds timeout 120000 ms
**_21/08/10 07:32:44 INFO KubernetesClusterSchedulerBackend: Requesting to kill executor(s) 1
21/08/10 07:32:44 INFO KubernetesClusterSchedulerBackend: Actual list of executor(s) to be killed is 1_**
21/08/10 07:32:44 INFO TaskSchedulerImpl: Executor 1 on 192.168.11.77 killed by driver.
21/08/10 07:32:44 INFO TaskSetManager: task 10.0 in stage 6.0 (TID 92) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
21/08/10 07:32:44 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
21/08/10 07:32:44 INFO BlockManagerMaster: Removal of executor 1 requested
21/08/10 07:32:44 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 1
21/08/10 07:32:44 INFO TaskSetManager: Starting task 10.1 in stage 6.0 (TID 102) (192.168.5.51, executor 2, partition 10, ANY, 4583 bytes) taskResourceAssignments Map()

Executor exited without error.

21/08/10 07:30:32 INFO MemoryStore: Block rdd_37_8 stored as values in memory (estimated size 24.0 B, free 16.1 MiB)
21/08/10 07:30:32 INFO Executor: Finished task 8.0 in stage 6.0 (TID 90). 1552 bytes result sent to driver
21/08/10 07:30:32 INFO CoarseGrainedExecutorBackend: Got assigned task 91
21/08/10 07:30:32 INFO Executor: Running task 9.0 in stage 6.0 (TID 91)
21/08/10 07:30:32 INFO BlockManager: Found block rdd_6_9 locally
21/08/10 07:30:32 INFO OneDAL: Partition index: 9, numCols: 20, numRows: 168405
21/08/10 07:30:32 INFO OneDAL: Loading native libraries
21/08/10 07:30:32 INFO MemoryStore: Block rdd_37_9 stored as values in memory (estimated size 24.0 B, free 16.1 MiB)
21/08/10 07:30:32 INFO Executor: Finished task 9.0 in stage 6.0 (TID 91). 1552 bytes result sent to driver
21/08/10 07:30:32 INFO CoarseGrainedExecutorBackend: Got assigned task 92
21/08/10 07:30:32 INFO Executor: Running task 10.0 in stage 6.0 (TID 92)
21/08/10 07:30:32 INFO BlockManager: Found block rdd_6_10 locally
21/08/10 07:30:32 INFO OneDAL: Partition index: 10, numCols: 20, numRows: 168920
21/08/10 07:30:32 INFO OneDAL: Loading native libraries

cc @xwu99 @carsonwang

Aug 11 '21 06:08 zhixingheyi-tian

Add another test to make the domain name shorter with "spark.kubernetes.executor.podNamePrefix=oapmllib"

sh ./spark-submit-client.sh --conf spark.executor.instances=2 \
  --name oap-mllib  \
  --conf spark.kubernetes.executor.podNamePrefix=oapmllib  \
  --conf spark.executor.memoryOverhead=512m  \
  --conf spark.executor.cores=1  \
  --conf spark.default.parallelism=2  \
  --conf spark.sql.shuffle.partitions=2  \
  --conf spark.shuffle.reduceLocality.enabled=false  \
  --conf spark.driver.extraClassPath=/opt/home/conda/envs/oap-1.2.0.emr630/oap_jars/oap-mllib-1.2.0.jar \
  --conf spark.executor.extraClassPath=/opt/home/conda/envs/oap-1.2.0.emr630/oap_jars/oap-mllib-1.2.0.jar \
  --class com.intel.hibench.sparkbench.ml.DenseKMeans /opt/home/tools/HiBench/sparkbench/assembly/target/sparkbench-assembly-8.0-SNAPSHOT-dist.jar -k 10 --numIterations 5 --storageLevel MEMORY_ONLY --initMode Random s3a://aws-emr-resources-348941870272-us-east-2/datagen/HiBench/Kmeans/Input/3000000/samples

And the executor also restarted.

21/08/11 06:45:28 INFO TorrentBroadcast: Reading broadcast variable 10 took 26 ms
21/08/11 06:45:28 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 40.8 KiB, free 161.5 MiB)
21/08/11 06:45:28 INFO BlockManager: Found block rdd_39_0 locally
21/08/11 06:45:28 INFO OneCCL: Initializing with IP_PORT: 192.168.60.79_3000
OneCCL (native): init
"client: barrier read data": read: can not process all data, size 4, shift 0

Aug 11 '21 06:08 zhixingheyi-tian