airflow
airflow copied to clipboard
Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator
In this PR I have added retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator.
This logic is needed for preventing 'No agent available' error. The error appears time to time when users try to create a Resource or Job. This issue is inside Kubernetes and in the current moment has no solution. Like a temporary solution we decided to retry Job or Resource creation request each time when this error appears.
Link for the same issue for cert-manager service: https://github.com/cert-manager/cert-manager/issues/6457
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst
or {issue_number}.significant.rst
, in newsfragments.
why don't you use the internal retry parameter of airflow ?
why don't you use the internal retry parameter of airflow ?
I use the same approach what we use for retry Pod creation: https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L347C1-L356C1
Also, I didn't see any other approaches in the cncf
package for retry functionality in this case I decided to use the same. Could you please share with me the example of Airflow's code for this internal retry parameter logic?
I was thinking about the BaseOperator argument retries
PythonOperator(
task_id="aa",
retries=3,
python_callable=toto,
)
related : https://github.com/apache/airflow/pull/15137
I was thinking about the BasOperator argument
retries
PythonOperator( task_id="aa", retries=3, python_callable=toto, )
This is an option for users. If a user wants to retry a specific task, they can use this parameter. Here, if I understand correctly, @MaksYermak wants to retry without the user being aware or needing to do something.
Could you add tests to cover these retries?
Sure, I have added a unit tests.
Hi @raphaelauv @dirrao @vincbeck ! Can you please check the changes here again? Thanks!