UN-2798 [FIX] Fix WorkflowFileExecution stuck in EXECUTING status due to database connection errors
What
- Implemented centralized database retry mechanism with connection pool refresh for Django ORM operations
- Added automatic retry logic for Celery database backend operations
- Applied retry decorators to critical workflow chain components to prevent stuck executions
Why
- WorkflowFileExecution records were getting stuck in "EXECUTING" status when database connections were dropped
- Database connection errors in workflow chains were causing permanent execution failures without recovery
- Lack of unified retry mechanism across Django ORM and Celery SQLAlchemy operations
How
- Created
db_constants.pywith centralized error patterns, configuration, and logging templates - Implemented
db_retry.pywith decorator and context manager for Django ORM retry logic with connection pool refresh - Added
celery_db_retry.pyfor Celery SQLAlchemy backend retry with engine disposal - Applied
@db_retry()decorator to critical workflow chain methods in workflow execution modules
Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)
- No - The retry mechanism is defensive and only activates on connection errors
- Existing functionality continues to work normally when database is healthy
- Retry delays are configurable via environment variables with sensible defaults
Database Migrations
- None required
Env Config
- Optional:
DB_RETRY_MAX_RETRIES(default: 3) - Optional:
DB_RETRY_BASE_DELAY(default: 1.0 seconds) - Optional:
DB_RETRY_MAX_DELAY(default: 30.0 seconds) - Optional:
CELERY_USE_BUILTIN_RETRY(default: False)
Relevant Docs
- Added comprehensive docstrings in all retry modules
- Centralized configuration documentation in
db_constants.py
Related Issues or PRs
- Jira Ticket: UN-2798
Dependencies Versions
- No new dependencies added
Notes on Testing
- Test with database connection interruptions during workflow execution
- Verify WorkflowFileExecution records complete successfully after temporary database outages
- Monitor retry logs to ensure proper backoff behavior
Screenshots
- N/A
Checklist
I have read and understood the Contribution Guidelines.
Summary by CodeRabbit
-
New Features
- Automatic retry for database operations across organization, usage, and workflow actions to reduce transient failures.
- Celery result backend gains hybrid, configurable retry with exponential backoff; background workers patched for greater resilience and stability.
-
Documentation
- Added sample environment variables documenting DB/Celery retry and connection pool options.
-
Chores
- Introduced environment-driven settings for retry counts/delays, connection timeouts, and SQL echo.
- Enhanced logging of configured retry behavior for visibility during runtime.
Walkthrough
Adds centralized DB error classification and retry utilities, applies a @db_retry() decorator across multiple Django DB entrypoints, introduces a hybrid Celery SQLAlchemy retry/patching module, updates Celery config and worker startup to apply the patch, and adds environment-driven retry and PgBouncer settings plus sample.env documentation.
Changes
| Cohort / File(s) | Summary |
|---|---|
Core retry utilitiesbackend/utils/db_constants.py, backend/utils/db_retry.py |
New modules providing error classification enums/patterns, RetryConfiguration and LogMessages, and reusable retry utilities (decorator, context manager, helpers) for Django ORM and Celery usage. |
Celery retry infrastructurebackend/backend/celery_db_retry.py |
New hybrid retry implementation supporting Celery built-in retry or custom exponential backoff; provides patching for Celery DatabaseBackend methods, engine options for PgBouncer, and helpers to dispose/refresh SQLAlchemy engines. |
Celery integration pointsbackend/backend/celery_config.py, backend/backend/celery_service.py, backend/backend/workers/file_processing/file_processing.py, backend/backend/workers/file_processing_callback/file_processing_callback.py |
Add result_backend_transport_options to CeleryConfig, conditionally configure built-in retry attributes, and call patch_celery_database_backend() after Celery app creation to enable DB retry behavior. |
Django retry application (decorators)backend/account_v2/organization.py, backend/usage_v2/helper.py, backend/utils/models/organization_mixin.py, backend/utils/user_context.py, backend/workflow_manager/file_execution/models.py, backend/workflow_manager/workflow_v2/execution.py, backend/workflow_manager/workflow_v2/models/execution.py |
Applied @db_retry() to numerous model/manager/static methods to add automatic retry semantics for transient DB errors; signatures unchanged. |
Settings and documentationbackend/backend/settings/base.py, backend/sample.env |
Added environment-driven settings for DB/Celery retry policies, connect timeout, and SQL echo; updated sample.env with commented examples documenting the new variables. |
Sequence Diagram(s)
sequenceDiagram
autonumber
actor Env as Environment
participant App as Celery App
participant Patch as celery_db_retry.patch_celery_database_backend
participant Cfg as CeleryConfig
participant DBB as Celery DatabaseBackend
participant DB as PostgreSQL/PgBouncer
Env->>Cfg: Read CELERY_* env vars
Note right of Cfg: Set transport options and\nconditional built-in retry attrs
App->>Patch: Invoke after Celery app creation
Patch->>DBB: Patch _store_result/_get_task_meta_for/get_result...
Note over DBB: Methods wrapped with\ncustom backoff or builtin retry
App->>DBB: Store/get task result
DBB->>DB: Execute SQL
alt Connection/transient error
DBB-->>DBB: Classify error (utils.db_constants)
DBB-->>DBB: Backoff + optional pool dispose
DBB->>DB: Retry until success/exhausted
else Success
DBB-->>App: Return result/meta
end
sequenceDiagram
autonumber
participant Caller as App code
participant Wrap as utils.db_retry.db_retry
participant ORM as Django ORM
participant DB as PostgreSQL/PgBouncer
Caller->>Wrap: Call decorated function
Wrap->>ORM: Execute query/operation
ORM->>DB: SQL
alt Error occurs
Wrap-->>Wrap: Classify error & decide retry
opt Requires pool refresh
Wrap-->>ORM: close_old_connections()
end
Wrap->>DB: Retry with exponential backoff
alt Exhausted
Wrap-->>Caller: Raise last error
else Success
Wrap-->>Caller: Return result
end
else Success
Wrap-->>Caller: Return result
end
Estimated code review effort
🎯 4 (Complex) | ⏱️ ~60 minutes
Pre-merge checks (2 passed, 1 warning)
❌ Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | ⚠️ Warning | Docstring coverage is 76.47% which is insufficient. The required threshold is 80.00%. | You can run @coderabbitai generate docstrings to improve docstring coverage. |
✅ Passed checks (2 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title Check | ✅ Passed | The title concisely and accurately reflects the main goal of the changeset by specifying the fix of WorkflowFileExecution stuck in EXECUTING due to database connection errors, aligning with the PR’s primary objective without unnecessary detail. |
| Description Check | ✅ Passed | The description uses all required template sections—What, Why, How, risk assessment, migrations, env config, docs, related issues, dependencies, testing notes, screenshots, and checklist—and each section is populated with relevant details, satisfying the repository’s PR description guidelines. |
✨ Finishing Touches
- [ ] 📝 Generate Docstrings
🧪 Generate unit tests
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
- [ ] Commit unit tests in branch
UN-2798-fix-database-retry-workflow-execution-stuck-issue
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
| filepath | function | $$\textcolor{#23d18b}{\tt{passed}}$$ | SUBTOTAL |
|---|---|---|---|
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_logs}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_cleanup}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_cleanup\_skip}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_client\_init}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_get\_image\_exists}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_get\_image}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config\_without\_mount}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_run\_container}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_get\_image\_for\_sidecar}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_sidecar\_container}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{TOTAL}}$$ | $$\textcolor{#23d18b}{\tt{11}}$$ | $$\textcolor{#23d18b}{\tt{11}}$$ |
Quality Gate passed
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
@muhammad-ali-e
-
I was thinking the connection pool mechanism would automatically internally try to connect to db after failures. Ideally we should have only tweaked env settings for retry interval, retry count etc. Instead seemingly we are interfering directly with the connection pool lifecycle, which might over complicate things.
-
Also I am hoping that if we continue to leave the retrying to the connection pool itself, then we need not sprinkle decorators in each and every call site but instead have it in a single central place.
Let's try to keep it simple and maybe revisit the approach?