airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator

Open MaksYermak opened this issue 10 months ago • 5 comments

In this PR I have added retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator.

This logic is needed for preventing 'No agent available' error. The error appears time to time when users try to create a Resource or Job. This issue is inside Kubernetes and in the current moment has no solution. Like a temporary solution we decided to retry Job or Resource creation request each time when this error appears.

Link for the same issue for cert-manager service: https://github.com/cert-manager/cert-manager/issues/6457


^ Add meaningful description above Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

MaksYermak avatar Apr 23 '24 10:04 MaksYermak

why don't you use the internal retry parameter of airflow ?

raphaelauv avatar Apr 23 '24 11:04 raphaelauv

why don't you use the internal retry parameter of airflow ?

I use the same approach what we use for retry Pod creation: https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L347C1-L356C1 Also, I didn't see any other approaches in the cncf package for retry functionality in this case I decided to use the same. Could you please share with me the example of Airflow's code for this internal retry parameter logic?

MaksYermak avatar Apr 23 '24 12:04 MaksYermak

I was thinking about the BaseOperator argument retries

    PythonOperator(
        task_id="aa",
        retries=3,
        python_callable=toto,
    )

raphaelauv avatar Apr 23 '24 12:04 raphaelauv

related : https://github.com/apache/airflow/pull/15137

raphaelauv avatar Apr 23 '24 12:04 raphaelauv

I was thinking about the BasOperator argument retries

    PythonOperator(
        task_id="aa",
        retries=3,
        python_callable=toto,
    )

This is an option for users. If a user wants to retry a specific task, they can use this parameter. Here, if I understand correctly, @MaksYermak wants to retry without the user being aware or needing to do something.

vincbeck avatar Apr 23 '24 14:04 vincbeck

Could you add tests to cover these retries?

Sure, I have added a unit tests.

MaksYermak avatar May 07 '24 10:05 MaksYermak

Hi @raphaelauv @dirrao @vincbeck ! Can you please check the changes here again? Thanks!

VladaZakharova avatar May 10 '24 09:05 VladaZakharova