unstract UN-2798 [FIX] Fix WorkflowFileExecution stuck in EXECUTING status due to database connection errors

What

Implemented centralized database retry mechanism with connection pool refresh for Django ORM operations
Added automatic retry logic for Celery database backend operations
Applied retry decorators to critical workflow chain components to prevent stuck executions

Why

WorkflowFileExecution records were getting stuck in "EXECUTING" status when database connections were dropped
Database connection errors in workflow chains were causing permanent execution failures without recovery
Lack of unified retry mechanism across Django ORM and Celery SQLAlchemy operations

How

Created db_constants.py with centralized error patterns, configuration, and logging templates
Implemented db_retry.py with decorator and context manager for Django ORM retry logic with connection pool refresh
Added celery_db_retry.py for Celery SQLAlchemy backend retry with engine disposal
Applied @db_retry() decorator to critical workflow chain methods in workflow execution modules

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

No - The retry mechanism is defensive and only activates on connection errors
Existing functionality continues to work normally when database is healthy
Retry delays are configurable via environment variables with sensible defaults

Database Migrations

None required

Env Config

Optional: DB_RETRY_MAX_RETRIES (default: 3)
Optional: DB_RETRY_BASE_DELAY (default: 1.0 seconds)
Optional: DB_RETRY_MAX_DELAY (default: 30.0 seconds)
Optional: CELERY_USE_BUILTIN_RETRY (default: False)

Relevant Docs

Added comprehensive docstrings in all retry modules
Centralized configuration documentation in db_constants.py

Related Issues or PRs

Jira Ticket: UN-2798

Dependencies Versions

No new dependencies added

Notes on Testing

Test with database connection interruptions during workflow execution
Verify WorkflowFileExecution records complete successfully after temporary database outages
Monitor retry logs to ensure proper backoff behavior

Screenshots

N/A

Checklist

I have read and understood the Contribution Guidelines.

Sep 09 '25 12:09 muhammad-ali-e

Summary by CodeRabbit

New Features
- Automatic retry for database operations across organization, usage, and workflow actions to reduce transient failures.
- Celery result backend gains hybrid, configurable retry with exponential backoff; background workers patched for greater resilience and stability.
Documentation
- Added sample environment variables documenting DB/Celery retry and connection pool options.
Chores
- Introduced environment-driven settings for retry counts/delays, connection timeouts, and SQL echo.
- Enhanced logging of configured retry behavior for visibility during runtime.

Walkthrough

Adds centralized DB error classification and retry utilities, applies a @db_retry() decorator across multiple Django DB entrypoints, introduces a hybrid Celery SQLAlchemy retry/patching module, updates Celery config and worker startup to apply the patch, and adds environment-driven retry and PgBouncer settings plus sample.env documentation.

Changes

Cohort / File(s)	Summary
Core retry utilities `backend/utils/db_constants.py`, `backend/utils/db_retry.py`	New modules providing error classification enums/patterns, RetryConfiguration and LogMessages, and reusable retry utilities (decorator, context manager, helpers) for Django ORM and Celery usage.
Celery retry infrastructure `backend/backend/celery_db_retry.py`	New hybrid retry implementation supporting Celery built-in retry or custom exponential backoff; provides patching for Celery DatabaseBackend methods, engine options for PgBouncer, and helpers to dispose/refresh SQLAlchemy engines.
Celery integration points `backend/backend/celery_config.py`, `backend/backend/celery_service.py`, `backend/backend/workers/file_processing/file_processing.py`, `backend/backend/workers/file_processing_callback/file_processing_callback.py`	Add `result_backend_transport_options` to CeleryConfig, conditionally configure built-in retry attributes, and call `patch_celery_database_backend()` after Celery app creation to enable DB retry behavior.
Django retry application (decorators) `backend/account_v2/organization.py`, `backend/usage_v2/helper.py`, `backend/utils/models/organization_mixin.py`, `backend/utils/user_context.py`, `backend/workflow_manager/file_execution/models.py`, `backend/workflow_manager/workflow_v2/execution.py`, `backend/workflow_manager/workflow_v2/models/execution.py`	Applied `@db_retry()` to numerous model/manager/static methods to add automatic retry semantics for transient DB errors; signatures unchanged.
Settings and documentation `backend/backend/settings/base.py`, `backend/sample.env`	Added environment-driven settings for DB/Celery retry policies, connect timeout, and SQL echo; updated `sample.env` with commented examples documenting the new variables.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Env as Environment
  participant App as Celery App
  participant Patch as celery_db_retry.patch_celery_database_backend
  participant Cfg as CeleryConfig
  participant DBB as Celery DatabaseBackend
  participant DB as PostgreSQL/PgBouncer

  Env->>Cfg: Read CELERY_* env vars
  Note right of Cfg: Set transport options and\nconditional built-in retry attrs
  App->>Patch: Invoke after Celery app creation
  Patch->>DBB: Patch _store_result/_get_task_meta_for/get_result...
  Note over DBB: Methods wrapped with\ncustom backoff or builtin retry

  App->>DBB: Store/get task result
  DBB->>DB: Execute SQL
  alt Connection/transient error
    DBB-->>DBB: Classify error (utils.db_constants)
    DBB-->>DBB: Backoff + optional pool dispose
    DBB->>DB: Retry until success/exhausted
  else Success
    DBB-->>App: Return result/meta
  end

sequenceDiagram
  autonumber
  participant Caller as App code
  participant Wrap as utils.db_retry.db_retry
  participant ORM as Django ORM
  participant DB as PostgreSQL/PgBouncer

  Caller->>Wrap: Call decorated function
  Wrap->>ORM: Execute query/operation
  ORM->>DB: SQL
  alt Error occurs
    Wrap-->>Wrap: Classify error & decide retry
    opt Requires pool refresh
      Wrap-->>ORM: close_old_connections()
    end
    Wrap->>DB: Retry with exponential backoff
    alt Exhausted
      Wrap-->>Caller: Raise last error
    else Success
      Wrap-->>Caller: Return result
    end
  else Success
    Wrap-->>Caller: Return result
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 76.47% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title concisely and accurately reflects the main goal of the changeset by specifying the fix of WorkflowFileExecution stuck in EXECUTING due to database connection errors, aligning with the PR’s primary objective without unnecessary detail.
Description Check	✅ Passed	The description uses all required template sections—What, Why, How, risk assessment, migrations, env config, docs, related issues, dependencies, testing notes, screenshots, and checklist—and each section is populated with relevant details, satisfying the repository’s PR description guidelines.

✨ Finishing Touches

[ ] 📝 Generate Docstrings

🧪 Generate unit tests

[ ] Create PR with unit tests
[ ] Post copyable unit tests in a comment
[ ] Commit unit tests in branch UN-2798-fix-database-retry-workflow-execution-stuck-issue

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Sep 09 '25 12:09 coderabbitai[bot]

filepath function $$\textcolor{#23d18b}{\tt{passed}}$$ SUBTOTAL

$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_logs}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$

$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_cleanup}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$

$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_cleanup\_skip}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$

$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_client\_init}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$

$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image\_exists}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$

$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$

$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$

$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config\_without\_mount}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$

$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_run\_container}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$

$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image\_for\_sidecar}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$

$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_sidecar\_container}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$

$$\textcolor{#23d18b}{\tt{TOTAL}}$$ $$\textcolor{#23d18b}{\tt{11}}$$ $$\textcolor{#23d18b}{\tt{11}}$$

filepath	function	$$\textcolor{#23d18b}{\tt{passed}}$$	SUBTOTAL
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_logs}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_cleanup}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_cleanup\_skip}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_client\_init}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image\_exists}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config\_without\_mount}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_run\_container}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image\_for\_sidecar}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_sidecar\_container}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{TOTAL}}$$		$$\textcolor{#23d18b}{\tt{11}}$$	$$\textcolor{#23d18b}{\tt{11}}$$

Sep 10 '25 02:09 github-actions[bot]

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Sep 10 '25 02:09 sonarqubecloud[bot]

@muhammad-ali-e

I was thinking the connection pool mechanism would automatically internally try to connect to db after failures. Ideally we should have only tweaked env settings for retry interval, retry count etc. Instead seemingly we are interfering directly with the connection pool lifecycle, which might over complicate things.
Also I am hoping that if we continue to leave the retrying to the connection pool itself, then we need not sprinkle decorators in each and every call site but instead have it in a single central place.

Let's try to keep it simple and maybe revisit the approach?

Sep 10 '25 08:09 hari-kuriakose