spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

Unable to end spark application if executor goes to error state

Open GajaHebbar opened this issue 2 years ago • 2 comments

Create sparkapplication with driver and executor There may be a case where memory usage increases or any config gets deleted from executor storage.

then executor goes to error stage and new executor tries to come up. but this process continues indefinitely. Also normal termination/completion will not be possible

================

sx-fsp-f01d2344-9d28-49b6-97a2-d891ae2a8752-public-2a955f898bd67a78-exec-36 0/1 Error 0 50s sx-fsp-f01d2344-9d28-49b6-97a2-d891ae2a8752-public-2a955f898bd67a78-exec-37 0/1 Error 0 43s sx-fsp-f01d2344-9d28-49b6-97a2-d891ae2a8752-public-2a955f898bd67a78-exec-38 0/1 Error 0 36s sx-fsp-f01d2344-9d28-49b6-97a2-d891ae2a8752-public-2a955f898bd67a78-exec-39 0/1 Error 0 29s sx-fsp-f01d2344-9d28-49b6-97a2-d891ae2a8752-public-2a955f898bd67a78-exec-4 0/1 Error 0 4m43s sx-fsp-f01d2344-9d28-49b6-97a2-d891ae2a8752-public-2a955f898bd67a78-exec-40 0/1 Error 0 22s sx-fsp-f01d2344-9d28-49b6-97a2-d891ae2a8752-public-2a955f898bd67a78-exec-41 0/1 Error 0 15s sx-fsp-f01d2344-9d28-49b6-97a2-d891ae2a8752-public-2a955f898bd67a78-exec-42 0/1 Error 0 9s sx-fsp-f01d2344-9d28-49b6-97a2-d891ae2a8752-public-2a955f898bd67a78-exec-5 0/1 Error 0 4m35s sx-fsp-f01d2344-9d28-49b6-97a2-d891ae2a8752-public-2a955f898bd67a78-exec-6 0/1 Error 0 4m28s sx-fsp-f01d2344-9d28-49b6-97a2-d891ae2a8752-public-2a955f898bd67a78-exec-7 0/1 Error 0 4m21s sx-fsp-f01d2344-9d28-49b6-97a2-d891ae2a8752-public-2a955f898bd67a78-exec-8 0/1 Error 0 4m14s sx-fsp-f01d2344-9d28-49b6-97a2-d891ae2a8752-public-2a955f898bd67a78-exec-9 0/1 Error 0 4m7s sxfspf01d23449d2849b697a2d891ae2a8752public-driver 1/1 Running 0 170m

================

kubectl get sparkapplication -n dep-ju334ba5kppx27naah2qeh4hu3pjda NAME STATUS ATTEMPTS START FINISH AGE sxfspf01d23449d2849b697a2d891ae2a8752public RUNNING 1 2023-07-25T06:57:09Z 4h33m

============================== kubectl describe pod sx-fsp-f01d2344-9d28-49b6-97a2-d891ae2a8752-public-2a955f898bd67a78-exec-6 -n dep-ju334ba5kppx27naah2qeh4hu3pjda Name: sx-fsp-f01d2344-9d28-49b6-97a2-d891ae2a8752-public-2a955f898bd67a78-exec-6 Namespace: dep-ju334ba5kppx27naah2qeh4hu3pjda Priority: 0 Node: 10.208.45.215/10.208.45.215 Start Time: Tue, 25 Jul 2023 09:42:47 +0000 Labels: spark-app-name=sx-fsp-f01d2344-9d28-49b6-97a2-d891ae2a8752-public spark-app-selector=spark-01e15d600ac34711ac73c8b8c0865e2e spark-exec-id=6 spark-exec-inactive=true spark-exec-resourceprofile-id=0 spark-role=executor spark-version=3.4.0 sparkoperator.k8s.io/app-name=sxfspf01d23449d2849b697a2d891ae2a8752public sparkoperator.k8s.io/launched-by-spark-operator=true sparkoperator.k8s.io/submission-id=f2df417d-abad-49ea-b59b-d52b35d89fcb version=3.4.0 Annotations: cni.projectcalico.org/containerID: b08774e1170a885b9bc94199e918702473575a61990b8d3da6c8f2b2bcd8e482 cni.projectcalico.org/podIP: cni.projectcalico.org/podIPs: deploymentID: ocid1.goldengatedeployment.oc1.phx.amaaaaaarhwbfeqa6b3uvitni4iet2ju334ba5kppx27naah2qeh4hu3pjda Status: Failed IP: 10.244.4.68 IPs: IP: 10.244.4.68 Controlled By: Pod/sxfspf01d23449d2849b697a2d891ae2a8752public-driver Containers: spark-kubernetes-executor: Container ID: cri-o://f5b6e8c5c0a498229a10e853dffd5a696b359729941141a988415a647a0f046b Image: us-phoenix-1.ocir.io/axoxdievda5j/ggsa-runtime:21.7.0.0.0_230719.0000-yathishtest1 Image ID: us-phoenix-1.ocir.io/axoxdievda5j/ggsa-runtime@sha256:9c0debb3f89ee7d06bc0cc61bc3c466080f59b2508124117761f65c030b4d9a4 Port: 7079/TCP Host Port: 0/TCP Args: executor State: Terminated Reason: Error Exit Code: 1 Started: Tue, 25 Jul 2023 09:42:48 +0000 Finished: Tue, 25 Jul 2023 09:42:53 +0000 Ready: False Restart Count: 0

Is there any way we can configure the sparkapplication, so that if executor pod goes to error then terminate the sparkapplication(just like restartPolicy)

GajaHebbar avatar Jul 25 '23 11:07 GajaHebbar

@GajaHebbar Spark driver and executor behavior comes from Apache Spark and Spark operator does not have any control over it. Driver is the one which spins up executors and that behavior is controlled via core Apache Spark's code. Spark Operator's restartPolicy has nothing do with it.

Coming to your problem. I have faced something similar. As you might already know GC is usually caused by either code issues or GC settings. That needs to be figured out.

puneetloya avatar Aug 11 '23 04:08 puneetloya

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 14 '24 04:08 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Sep 03 '24 08:09 github-actions[bot]