airflow
airflow copied to clipboard
SparkSubmitOperator not mark task as success after spark job complete job
Apache Airflow Provider(s)
apache-spark
Versions of Apache Airflow Providers
apache-airflow-providers-apache-hive==6.1.2 apache-airflow-providers-apache-spark==4.1.1 apache-airflow-providers-cncf-kubernetes==7.3.0 apache-airflow-providers-common-sql==1.6.0
Apache Airflow version
2.6.3
Operating System
Debian GNU/Linux 11 (bullseye)
Deployment
Official Apache Airflow Helm Chart
Deployment details
Dockerfile FROM image apache/airflow:2.6.3-python3.10 airflow helm Chart
What happened
We run lots of jobs every day, but every day one or two tasks the Spark complete the job, but the task still keeps in running state with no new logs, until manually change the state t success.
Task with wrong behavior, the logs keep like this:
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:13 INFO LoggingPodStatusWatcherImpl: Application status for spark-7edea139decf4f90bd5d0fbac97f8869 (phase: Running)
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:13 INFO LoggingPodStatusWatcherImpl: State changed, new state:
[2024-05-09, 06:25:13 -03] {spark_submit.py:476} INFO - Identified spark driver pod: reverent-mcclintock-2cf0078f5caa813e-driver
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - pod name: reverent-mcclintock-2cf0078f5caa813e-driver
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - namespace: spark
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - labels: Client -> 3irmaos, Environment -> production, Name -> etl-ira-ciss---3irmaos, Product -> ira, Role -> process, Step -> refinement, airflow-attempt -> 1, airflow-map-index -> , airflow-spark-task-build-version -> spark-k8s-1.0.0, airflow_dag -> cluster_ira_ciss_daily_3irmaos, airflow_task -> load.load_client_product_3irmaos, airflow_task_uuid -> ff01d563-5ea8-555c-8444-413b1f29a5ee, spark-affinty-label -> c8cd4607d6b14f3a81bcce821d0054bb, spark-app-name -> reverent-mcclintock, spark-app-selector -> spark-7edea139decf4f90bd5d0fbac97f8869, spark-role -> driver, spark-version -> 3.4.2, spark_job_execution_date -> 2024-05-09, spark_job_name -> load.load_client_product_3irmaos-2024-05-09, spotinst.io/restrict-scale-down -> true
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - pod uid: 9ed68d10-c28e-4a4b-8136-e68b9be49b9f
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - creation time: 2024-05-09T09:23:51Z
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - service account name: spark
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - volumes: pod-template-volume, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-69jld
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - node name: i-0097e957e30fe4dd5
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - start time: 2024-05-09T09:23:51Z
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - phase: Running
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container status:
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container name: spark-kubernetes-driver
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container image: xxxx.com/xxxxxx/pyspark:v2_patch13
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container state: terminated
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container started at: 2024-05-09T09:23:53Z
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container finished at: 2024-05-09T09:25:13Z
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - exit code: 0
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - termination reason: Completed
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: Application status for spark-7edea139decf4f90bd5d0fbac97f8869 (phase: Running)
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: State changed, new state:
[2024-05-09, 06:25:14 -03] {spark_submit.py:476} INFO - Identified spark driver pod: reverent-mcclintock-2cf0078f5caa813e-driver
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - pod name: reverent-mcclintock-2cf0078f5caa813e-driver
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - namespace: spark
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - labels: Client -> 3irmaos, Environment -> production, Name -> etl-ira-ciss---3irmaos, Product -> ira, Role -> process, Step -> refinement, airflow-attempt -> 1, airflow-map-index -> , airflow-spark-task-build-version -> spark-k8s-1.0.0, airflow_dag -> cluster_ira_ciss_daily_3irmaos, airflow_task -> load.load_client_product_3irmaos, airflow_task_uuid -> ff01d563-5ea8-555c-8444-413b1f29a5ee, spark-affinty-label -> c8cd4607d6b14f3a81bcce821d0054bb, spark-app-name -> reverent-mcclintock, spark-app-selector -> spark-7edea139decf4f90bd5d0fbac97f8869, spark-role -> driver, spark-version -> 3.4.2, spark_job_execution_date -> 2024-05-09, spark_job_name -> load.load_client_product_3irmaos-2024-05-09, spotinst.io/restrict-scale-down -> true
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - pod uid: 9ed68d10-c28e-4a4b-8136-e68b9be49b9f
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - creation time: 2024-05-09T09:23:51Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - service account name: spark
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - volumes: pod-template-volume, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-69jld
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - node name: i-0097e957e30fe4dd5
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - start time: 2024-05-09T09:23:51Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - phase: Succeeded
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container status:
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container name: spark-kubernetes-driver
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container image: xxxx.com/xxxxxx/pyspark:v2_patch13
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container state: terminated
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container started at: 2024-05-09T09:23:53Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container finished at: 2024-05-09T09:25:13Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - exit code: 0
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - termination reason: Completed
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: Application status for spark-7edea139decf4f90bd5d0fbac97f8869 (phase: Succeeded)
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: Container final statuses:
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO -
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container name: spark-kubernetes-driver
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container image: xxxx.com/xxxxxx/pyspark:v2_patch13
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container state: terminated
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container started at: 2024-05-09T09:23:53Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container finished at: 2024-05-09T09:25:13Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - exit code: 0
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - termination reason: Completed
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: Application reverent-mcclintock with submission ID spark:reverent-mcclintock-2cf0078f5caa813e-driver finished
What you think should happen instead
Expected behavior: a normal task will have the taskinstance.py part marking the task as success :
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - container state: terminated
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - container started at: 2024-05-09T09:24:04Z
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - container finished at: 2024-05-09T09:25:07Z
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - exit code: 0
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - termination reason: Completed
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:08 INFO LoggingPodStatusWatcherImpl: Application elegant-rubin with submission ID spark:elegant-rubin-91cab58f5caaaab2-driver finished
[2024-05-09, 06:25:09 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:09 INFO ShutdownHookManager: Shutdown hook called
[2024-05-09, 06:25:09 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:09 INFO ShutdownHookManager: Deleting directory /tmp/spark-175bae5e-77d3-45b1-aa3c-cc88bdd8a462
[2024-05-09, 06:25:09 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:09 INFO ShutdownHookManager: Deleting directory /tmp/spark-70c6860c-1925-4858-be1f-23e9c6d5723a
[2024-05-09, 06:25:09 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:09 INFO ShutdownHookManager: Deleting directory /tmp/spark-236d1121-2075-4b6e-bb54-2e3e66decd6a
[2024-05-09, 06:25:09 -03] {taskinstance.py:1345} INFO - Marking task as SUCCESS. dag_id=cluster_ira_ciss_daily_3irmaos, task_id=load.load_competitor_store_3irmaos, execution_date=20240508T070000, start_date=20240509T092347, end_date=20240509T092509
[2024-05-09, 06:25:09 -03] {local_task_job_runner.py:225} INFO - Task exited with return code 0
[2024-05-09, 06:25:09 -03] {taskinstance.py:2653} INFO - 0 downstream tasks scheduled from follow-on schedule check
How to reproduce
It's totally random, happens in different dags with different tasks
Anything else
No response
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
Can you see if this is occurring on the latest version of the spark provider package (4.8.0)?
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.
This issue has been closed because it has not received response from the issue author.