Fix DAG processor crash on MySQL connection failure during import error recording
Fix DAG processor crash on MySQL connection failure during import error recording
Fixes #59166
The DAG processor was crashing when MySQL connection failures occurred while recording DAG import errors to the database. The root cause was missing session.rollback() calls after caught exceptions, leaving the SQLAlchemy session in an invalid state. When session.flush() was subsequently called, it would raise a new exception that wasn't caught, causing the DAG processor to crash and enter restart loops.
This issue was observed in production environments where the DAG processor would restart 1,259 times in 4 days (~13 restarts/hour), leading to:
- Connection pool exhaustion
- Cascading failures across Airflow components
- Import errors not being recorded in the UI
- System instability
Changes
- Add
session.rollback()after caught exceptions in_update_import_errors() - Add
session.rollback()after caught exceptions in_update_dag_warnings() - Wrap
session.flush()in try-except withsession.rollback()on failure - Add comprehensive unit tests for all failure scenarios
- Update comments to clarify error handling behavior
Testing
Added 5 new unit tests in TestDagProcessorCrashFix class:
test_update_dag_parsing_results_handles_db_failure_gracefullytest_update_dag_parsing_results_handles_dag_warnings_db_failure_gracefullytest_update_dag_parsing_results_handles_session_flush_failure_gracefullytest_session_rollback_called_on_import_errors_failuretest_session_rollback_called_on_dag_warnings_failure
All tests pass and verify that:
- Database failures don't crash the DAG processor
session.rollback()is called correctly on failures- The processor continues gracefully after errors
Impact
The fix ensures the DAG processor gracefully handles database connection failures and continues processing other DAGs instead of crashing, preventing production outages from restart loops.
Thanks. Nice one.
From https://docs.sqlalchemy.org/en/20/orm/session_basics.html#flushing
When a failure occurs within a flush, in order to continue using that same Session, an explicit call to Session.rollback() is required after a flush fails, even though the underlying transaction will have been rolled back already (even if the database driver is technically in driver-level autocommit mode). This is so that the overall nesting pattern of so-called “subtransactions” is consistently maintained. The FAQ section “This Session’s transaction has been rolled back due to a previous exception during flush.” (or similar) contains a more detailed description of this behavior.
Maybe, the root cause of MYSQL connection failure is https://github.com/apache/airflow/issues/56879 Aside from import error recording, have you ever encountered any connection failures?
Maybe, the root cause of MYSQL connection failure is #56879 Aside from import error recording, have you ever encountered any connection failures?
totaly agree @wjddn279 Even more, the fix i suggested here kinda silence the issue you started dealing with cuz the inflight transactions rollback, so the connection close is handled in airflow in a way that crashs the service
Maybe, the root cause of MYSQL connection failure is #56879 Aside from import error recording, have you ever encountered any connection failures?
totaly agree @wjddn279 Even more, the fix i suggested here kinda silence the issue you started dealing with cuz the inflight transactions rollback, so the connection close is handled in airflow in a way that crashs the service
Yeah. Worth fixing it with gc freezing I think.