spark-operator
spark-operator copied to clipboard
Subsequent Delete and Add operations causes a dangling App
Hello,
Using image gcr.io/spark-operator/spark-operator:v1beta2-1.2.1-3.0.0
I encountered a weird problem when using Airflow's SparkKubernetesOperator and SparkKubernetesSensor.
In TL;DR, the SparkKubernetesOperator calls 2 subsequent API calls to the spark-operator - Delete
, and immediately after, Add
.
Then, the Sensor monitors the App state until it's FAILED
, UNKNOWN
, or COMPLETED
.
But, because of the subsequent Add
and Delete
calls happened when the same app was already running, the app is terminated without a new instance of it running, its result is the state field is blank.
What happens on the spark-on-k8s-operator side:
- The delete kills the driver pod, and it's starting to terminate. Also, it deletes the SparkApplication object
- The new SparkApplication object is added, and it calls spark-submit
- If the driver pod hasn't finished its termination process, the spark submit fails with the following message:
object is being deleted: pods "some-app-driver" already exists
- When the controller failed to submit because there is an existing driver pod, the app state isn't updated, and left blank .
- The driver pod teminates, and tries to update the App state. But, because it belongs to a different submission-id, when it calls
enqueueSparkAppForUpdate
, the check here fails. - The App State is never updated, and because it isn't running, it will never be.
Correct me if I'm wrong, but this is an unwanted behavior (for any same subsequent API calls, not only Airflow's operators specific implemantation of course).
I'll be happy to work on the solution if you think one is required. I thought of the following:
- Treat the
podAlreadyExistsError
the same way others submission errors are treated, and update the state toSparkSubmissionFailed
. - Do the above, but also check that the driver pod
submission-id
label is different from the current submission attempt (as I'm not sure why podAlreadyExists specific case is treated differently, maybe you can elaborate) - Introduce
Deleting
state. Instead of deleting the sparkApplication object uponDelete
API call, do it only after the driver pod has finished and calledOnPodDelete
, which in turn will enqueue the App deletion, and the object will be deleted in the sync function switch-case clause.
Thank you in advance.