airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Optionally retry db connection errors related to temporary DNS failures

Open xBis7 opened this issue 5 months ago • 3 comments

We are using Airflow with Kubernetes and we are occasionally experiencing failures due to DNS resolution blips. These failures are temporary and usually take mere seconds to go away.

The failure is occurring while a new db connection is required and Airflow performs a DNS lookup on the db hostname.

This can be resolved by retrying to establish the DB connection but there are better approaches.

This patch is adding a config flag that enables a db discovery check right before creating a new session object. If the option is turned on, we run a socket.getaddrinfo(...) on the db hostname to see if it can be looked up. If the failure is a temporary DNS error, then the check is retried a few times, so that it can give it enough time to resolve.

This approach is chosen for the following reasons

  • socket.getaddrinfo(...) is less expensive and much faster than retrying to create a new session
  • not all kinds of DNS errors are temporary and must be retried.
    • e.g. a wrong value on the database config option sql_alchemy_conn is also a DNS error but it shouldn't be retried
    • in case of an exception, the error that we get from socket.getaddrinfo(...) carries an error code which can be used for distinguishing between the different DNS errors
    • create_session() will give us an exception with the same error message but no other info
    • the error code is more reliable than the error message which might be prone to changes during a new version update of the component that is generating it

I've added a unit test and an integration test. The integration test hasn't been added under the integration package because it doesn't need any integration other than a db backend. The code is db agnostic but for the test I used specifically a postgres backend because it's easier to just hardcode a postgres address.

The check is turned off by default.


^ Add meaningful description above Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

xBis7 avatar Jun 10 '25 16:06 xBis7

@jason810496 Thank you for the review, I'll look into your comments!

xBis7 avatar Jun 11 '25 14:06 xBis7

@jason810496 I've addressed all of your comments. Can you take another look?

xBis7 avatar Jun 11 '25 18:06 xBis7

@jason810496 Thank you, I addressed your new comments.

xBis7 avatar Jun 12 '25 15:06 xBis7

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 02 '25 00:08 github-actions[bot]