airflow icon indicating copy to clipboard operation
airflow copied to clipboard

SparkSubmitOperator not mark task as success after spark job complete job

Open dannyeuu opened this issue 1 year ago • 2 comments
trafficstars

Apache Airflow Provider(s)

apache-spark

Versions of Apache Airflow Providers

apache-airflow-providers-apache-hive==6.1.2 apache-airflow-providers-apache-spark==4.1.1 apache-airflow-providers-cncf-kubernetes==7.3.0 apache-airflow-providers-common-sql==1.6.0

Apache Airflow version

2.6.3

Operating System

Debian GNU/Linux 11 (bullseye)

Deployment

Official Apache Airflow Helm Chart

Deployment details

Dockerfile FROM image apache/airflow:2.6.3-python3.10 airflow helm Chart

What happened

We run lots of jobs every day, but every day one or two tasks the Spark complete the job, but the task still keeps in running state with no new logs, until manually change the state t success.

Task with wrong behavior, the logs keep like this:

[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:13 INFO LoggingPodStatusWatcherImpl: Application status for spark-7edea139decf4f90bd5d0fbac97f8869 (phase: Running)
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:13 INFO LoggingPodStatusWatcherImpl: State changed, new state:
[2024-05-09, 06:25:13 -03] {spark_submit.py:476} INFO - Identified spark driver pod: reverent-mcclintock-2cf0078f5caa813e-driver
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - pod name: reverent-mcclintock-2cf0078f5caa813e-driver
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - namespace: spark
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - labels: Client -> 3irmaos, Environment -> production, Name -> etl-ira-ciss---3irmaos, Product -> ira, Role -> process, Step -> refinement, airflow-attempt -> 1, airflow-map-index -> , airflow-spark-task-build-version -> spark-k8s-1.0.0, airflow_dag -> cluster_ira_ciss_daily_3irmaos, airflow_task -> load.load_client_product_3irmaos, airflow_task_uuid -> ff01d563-5ea8-555c-8444-413b1f29a5ee, spark-affinty-label -> c8cd4607d6b14f3a81bcce821d0054bb, spark-app-name -> reverent-mcclintock, spark-app-selector -> spark-7edea139decf4f90bd5d0fbac97f8869, spark-role -> driver, spark-version -> 3.4.2, spark_job_execution_date -> 2024-05-09, spark_job_name -> load.load_client_product_3irmaos-2024-05-09, spotinst.io/restrict-scale-down -> true
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - pod uid: 9ed68d10-c28e-4a4b-8136-e68b9be49b9f
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - creation time: 2024-05-09T09:23:51Z
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - service account name: spark
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - volumes: pod-template-volume, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-69jld
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - node name: i-0097e957e30fe4dd5
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - start time: 2024-05-09T09:23:51Z
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - phase: Running
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container status:
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container name: spark-kubernetes-driver
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container image: xxxx.com/xxxxxx/pyspark:v2_patch13
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container state: terminated
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container started at: 2024-05-09T09:23:53Z
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container finished at: 2024-05-09T09:25:13Z
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - exit code: 0
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - termination reason: Completed
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: Application status for spark-7edea139decf4f90bd5d0fbac97f8869 (phase: Running)
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: State changed, new state:
[2024-05-09, 06:25:14 -03] {spark_submit.py:476} INFO - Identified spark driver pod: reverent-mcclintock-2cf0078f5caa813e-driver
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - pod name: reverent-mcclintock-2cf0078f5caa813e-driver
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - namespace: spark
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - labels: Client -> 3irmaos, Environment -> production, Name -> etl-ira-ciss---3irmaos, Product -> ira, Role -> process, Step -> refinement, airflow-attempt -> 1, airflow-map-index -> , airflow-spark-task-build-version -> spark-k8s-1.0.0, airflow_dag -> cluster_ira_ciss_daily_3irmaos, airflow_task -> load.load_client_product_3irmaos, airflow_task_uuid -> ff01d563-5ea8-555c-8444-413b1f29a5ee, spark-affinty-label -> c8cd4607d6b14f3a81bcce821d0054bb, spark-app-name -> reverent-mcclintock, spark-app-selector -> spark-7edea139decf4f90bd5d0fbac97f8869, spark-role -> driver, spark-version -> 3.4.2, spark_job_execution_date -> 2024-05-09, spark_job_name -> load.load_client_product_3irmaos-2024-05-09, spotinst.io/restrict-scale-down -> true
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - pod uid: 9ed68d10-c28e-4a4b-8136-e68b9be49b9f
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - creation time: 2024-05-09T09:23:51Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - service account name: spark
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - volumes: pod-template-volume, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-69jld
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - node name: i-0097e957e30fe4dd5
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - start time: 2024-05-09T09:23:51Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - phase: Succeeded
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container status:
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container name: spark-kubernetes-driver
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container image: xxxx.com/xxxxxx/pyspark:v2_patch13
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container state: terminated
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container started at: 2024-05-09T09:23:53Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container finished at: 2024-05-09T09:25:13Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - exit code: 0
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - termination reason: Completed
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: Application status for spark-7edea139decf4f90bd5d0fbac97f8869 (phase: Succeeded)
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: Container final statuses:
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container name: spark-kubernetes-driver
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container image: xxxx.com/xxxxxx/pyspark:v2_patch13
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container state: terminated
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container started at: 2024-05-09T09:23:53Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container finished at: 2024-05-09T09:25:13Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - exit code: 0
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - termination reason: Completed
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: Application reverent-mcclintock with submission ID spark:reverent-mcclintock-2cf0078f5caa813e-driver finished

What you think should happen instead

Expected behavior: a normal task will have the taskinstance.py part marking the task as success :

[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - container state: terminated
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - container started at: 2024-05-09T09:24:04Z
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - container finished at: 2024-05-09T09:25:07Z
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - exit code: 0
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - termination reason: Completed
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:08 INFO LoggingPodStatusWatcherImpl: Application elegant-rubin with submission ID spark:elegant-rubin-91cab58f5caaaab2-driver finished
[2024-05-09, 06:25:09 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:09 INFO ShutdownHookManager: Shutdown hook called
[2024-05-09, 06:25:09 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:09 INFO ShutdownHookManager: Deleting directory /tmp/spark-175bae5e-77d3-45b1-aa3c-cc88bdd8a462
[2024-05-09, 06:25:09 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:09 INFO ShutdownHookManager: Deleting directory /tmp/spark-70c6860c-1925-4858-be1f-23e9c6d5723a
[2024-05-09, 06:25:09 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:09 INFO ShutdownHookManager: Deleting directory /tmp/spark-236d1121-2075-4b6e-bb54-2e3e66decd6a
[2024-05-09, 06:25:09 -03] {taskinstance.py:1345} INFO - Marking task as SUCCESS. dag_id=cluster_ira_ciss_daily_3irmaos, task_id=load.load_competitor_store_3irmaos, execution_date=20240508T070000, start_date=20240509T092347, end_date=20240509T092509
[2024-05-09, 06:25:09 -03] {local_task_job_runner.py:225} INFO - Task exited with return code 0
[2024-05-09, 06:25:09 -03] {taskinstance.py:2653} INFO - 0 downstream tasks scheduled from follow-on schedule check

How to reproduce

It's totally random, happens in different dags with different tasks

Anything else

No response

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

dannyeuu avatar May 09 '24 14:05 dannyeuu

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

boring-cyborg[bot] avatar May 09 '24 14:05 boring-cyborg[bot]

Can you see if this is occurring on the latest version of the spark provider package (4.8.0)?

RNHTTR avatar May 09 '24 15:05 RNHTTR

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

github-actions[bot] avatar May 24 '24 00:05 github-actions[bot]

This issue has been closed because it has not received response from the issue author.

github-actions[bot] avatar Jun 01 '24 00:06 github-actions[bot]