airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Fix DAG processor crash on MySQL connection failure during import error recording

Open AmosG opened this issue 3 weeks ago • 5 comments

Fix DAG processor crash on MySQL connection failure during import error recording

Fixes #59166

The DAG processor was crashing when MySQL connection failures occurred while recording DAG import errors to the database. The root cause was missing session.rollback() calls after caught exceptions, leaving the SQLAlchemy session in an invalid state. When session.flush() was subsequently called, it would raise a new exception that wasn't caught, causing the DAG processor to crash and enter restart loops.

This issue was observed in production environments where the DAG processor would restart 1,259 times in 4 days (~13 restarts/hour), leading to:

  • Connection pool exhaustion
  • Cascading failures across Airflow components
  • Import errors not being recorded in the UI
  • System instability

Changes

  • Add session.rollback() after caught exceptions in _update_import_errors()
  • Add session.rollback() after caught exceptions in _update_dag_warnings()
  • Wrap session.flush() in try-except with session.rollback() on failure
  • Add comprehensive unit tests for all failure scenarios
  • Update comments to clarify error handling behavior

Testing

Added 5 new unit tests in TestDagProcessorCrashFix class:

  • test_update_dag_parsing_results_handles_db_failure_gracefully
  • test_update_dag_parsing_results_handles_dag_warnings_db_failure_gracefully
  • test_update_dag_parsing_results_handles_session_flush_failure_gracefully
  • test_session_rollback_called_on_import_errors_failure
  • test_session_rollback_called_on_dag_warnings_failure

All tests pass and verify that:

  1. Database failures don't crash the DAG processor
  2. session.rollback() is called correctly on failures
  3. The processor continues gracefully after errors

Impact

The fix ensures the DAG processor gracefully handles database connection failures and continues processing other DAGs instead of crashing, preventing production outages from restart loops.

AmosG avatar Dec 07 '25 19:12 AmosG

Thanks. Nice one.

potiuk avatar Dec 08 '25 22:12 potiuk

From https://docs.sqlalchemy.org/en/20/orm/session_basics.html#flushing

When a failure occurs within a flush, in order to continue using that same Session, an explicit call to Session.rollback() is required after a flush fails, even though the underlying transaction will have been rolled back already (even if the database driver is technically in driver-level autocommit mode). This is so that the overall nesting pattern of so-called “subtransactions” is consistently maintained. The FAQ section “This Session’s transaction has been rolled back due to a previous exception during flush.” (or similar) contains a more detailed description of this behavior.

potiuk avatar Dec 08 '25 22:12 potiuk

Maybe, the root cause of MYSQL connection failure is https://github.com/apache/airflow/issues/56879 Aside from import error recording, have you ever encountered any connection failures?

wjddn279 avatar Dec 10 '25 02:12 wjddn279

Maybe, the root cause of MYSQL connection failure is #56879 Aside from import error recording, have you ever encountered any connection failures?

totaly agree @wjddn279 Even more, the fix i suggested here kinda silence the issue you started dealing with cuz the inflight transactions rollback, so the connection close is handled in airflow in a way that crashs the service

AmosG avatar Dec 10 '25 07:12 AmosG

Maybe, the root cause of MYSQL connection failure is #56879 Aside from import error recording, have you ever encountered any connection failures?

totaly agree @wjddn279 Even more, the fix i suggested here kinda silence the issue you started dealing with cuz the inflight transactions rollback, so the connection close is handled in airflow in a way that crashs the service

Yeah. Worth fixing it with gc freezing I think.

potiuk avatar Dec 10 '25 18:12 potiuk