unstract icon indicating copy to clipboard operation
unstract copied to clipboard

UN-2798 [FIX] Fix WorkflowFileExecution stuck in EXECUTING status due to database connection errors

Open muhammad-ali-e opened this issue 3 months ago • 4 comments

What

  • Implemented centralized database retry mechanism with connection pool refresh for Django ORM operations
  • Added automatic retry logic for Celery database backend operations
  • Applied retry decorators to critical workflow chain components to prevent stuck executions

Why

  • WorkflowFileExecution records were getting stuck in "EXECUTING" status when database connections were dropped
  • Database connection errors in workflow chains were causing permanent execution failures without recovery
  • Lack of unified retry mechanism across Django ORM and Celery SQLAlchemy operations

How

  • Created db_constants.py with centralized error patterns, configuration, and logging templates
  • Implemented db_retry.py with decorator and context manager for Django ORM retry logic with connection pool refresh
  • Added celery_db_retry.py for Celery SQLAlchemy backend retry with engine disposal
  • Applied @db_retry() decorator to critical workflow chain methods in workflow execution modules

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

  • No - The retry mechanism is defensive and only activates on connection errors
  • Existing functionality continues to work normally when database is healthy
  • Retry delays are configurable via environment variables with sensible defaults

Database Migrations

  • None required

Env Config

  • Optional: DB_RETRY_MAX_RETRIES (default: 3)
  • Optional: DB_RETRY_BASE_DELAY (default: 1.0 seconds)
  • Optional: DB_RETRY_MAX_DELAY (default: 30.0 seconds)
  • Optional: CELERY_USE_BUILTIN_RETRY (default: False)

Relevant Docs

  • Added comprehensive docstrings in all retry modules
  • Centralized configuration documentation in db_constants.py

Related Issues or PRs

  • Jira Ticket: UN-2798

Dependencies Versions

  • No new dependencies added

Notes on Testing

  • Test with database connection interruptions during workflow execution
  • Verify WorkflowFileExecution records complete successfully after temporary database outages
  • Monitor retry logs to ensure proper backoff behavior

Screenshots

  • N/A

Checklist

I have read and understood the Contribution Guidelines.

muhammad-ali-e avatar Sep 09 '25 12:09 muhammad-ali-e

Summary by CodeRabbit

  • New Features

    • Automatic retry for database operations across organization, usage, and workflow actions to reduce transient failures.
    • Celery result backend gains hybrid, configurable retry with exponential backoff; background workers patched for greater resilience and stability.
  • Documentation

    • Added sample environment variables documenting DB/Celery retry and connection pool options.
  • Chores

    • Introduced environment-driven settings for retry counts/delays, connection timeouts, and SQL echo.
    • Enhanced logging of configured retry behavior for visibility during runtime.

Walkthrough

Adds centralized DB error classification and retry utilities, applies a @db_retry() decorator across multiple Django DB entrypoints, introduces a hybrid Celery SQLAlchemy retry/patching module, updates Celery config and worker startup to apply the patch, and adds environment-driven retry and PgBouncer settings plus sample.env documentation.

Changes

Cohort / File(s) Summary
Core retry utilities
backend/utils/db_constants.py, backend/utils/db_retry.py
New modules providing error classification enums/patterns, RetryConfiguration and LogMessages, and reusable retry utilities (decorator, context manager, helpers) for Django ORM and Celery usage.
Celery retry infrastructure
backend/backend/celery_db_retry.py
New hybrid retry implementation supporting Celery built-in retry or custom exponential backoff; provides patching for Celery DatabaseBackend methods, engine options for PgBouncer, and helpers to dispose/refresh SQLAlchemy engines.
Celery integration points
backend/backend/celery_config.py, backend/backend/celery_service.py, backend/backend/workers/file_processing/file_processing.py, backend/backend/workers/file_processing_callback/file_processing_callback.py
Add result_backend_transport_options to CeleryConfig, conditionally configure built-in retry attributes, and call patch_celery_database_backend() after Celery app creation to enable DB retry behavior.
Django retry application (decorators)
backend/account_v2/organization.py, backend/usage_v2/helper.py, backend/utils/models/organization_mixin.py, backend/utils/user_context.py, backend/workflow_manager/file_execution/models.py, backend/workflow_manager/workflow_v2/execution.py, backend/workflow_manager/workflow_v2/models/execution.py
Applied @db_retry() to numerous model/manager/static methods to add automatic retry semantics for transient DB errors; signatures unchanged.
Settings and documentation
backend/backend/settings/base.py, backend/sample.env
Added environment-driven settings for DB/Celery retry policies, connect timeout, and SQL echo; updated sample.env with commented examples documenting the new variables.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Env as Environment
  participant App as Celery App
  participant Patch as celery_db_retry.patch_celery_database_backend
  participant Cfg as CeleryConfig
  participant DBB as Celery DatabaseBackend
  participant DB as PostgreSQL/PgBouncer

  Env->>Cfg: Read CELERY_* env vars
  Note right of Cfg: Set transport options and\nconditional built-in retry attrs
  App->>Patch: Invoke after Celery app creation
  Patch->>DBB: Patch _store_result/_get_task_meta_for/get_result...
  Note over DBB: Methods wrapped with\ncustom backoff or builtin retry

  App->>DBB: Store/get task result
  DBB->>DB: Execute SQL
  alt Connection/transient error
    DBB-->>DBB: Classify error (utils.db_constants)
    DBB-->>DBB: Backoff + optional pool dispose
    DBB->>DB: Retry until success/exhausted
  else Success
    DBB-->>App: Return result/meta
  end
sequenceDiagram
  autonumber
  participant Caller as App code
  participant Wrap as utils.db_retry.db_retry
  participant ORM as Django ORM
  participant DB as PostgreSQL/PgBouncer

  Caller->>Wrap: Call decorated function
  Wrap->>ORM: Execute query/operation
  ORM->>DB: SQL
  alt Error occurs
    Wrap-->>Wrap: Classify error & decide retry
    opt Requires pool refresh
      Wrap-->>ORM: close_old_connections()
    end
    Wrap->>DB: Retry with exponential backoff
    alt Exhausted
      Wrap-->>Caller: Raise last error
    else Success
      Wrap-->>Caller: Return result
    end
  else Success
    Wrap-->>Caller: Return result
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.47% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title concisely and accurately reflects the main goal of the changeset by specifying the fix of WorkflowFileExecution stuck in EXECUTING due to database connection errors, aligning with the PR’s primary objective without unnecessary detail.
Description Check ✅ Passed The description uses all required template sections—What, Why, How, risk assessment, migrations, env config, docs, related issues, dependencies, testing notes, screenshots, and checklist—and each section is populated with relevant details, satisfying the repository’s PR description guidelines.
✨ Finishing Touches
  • [ ] 📝 Generate Docstrings
🧪 Generate unit tests
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment
  • [ ] Commit unit tests in branch UN-2798-fix-database-retry-workflow-execution-stuck-issue

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Sep 09 '25 12:09 coderabbitai[bot]

filepath function $$\textcolor{#23d18b}{\tt{passed}}$$ SUBTOTAL
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_logs}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_cleanup}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_cleanup\_skip}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_client\_init}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image\_exists}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config\_without\_mount}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_run\_container}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_get\_image\_for\_sidecar}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ $$\textcolor{#23d18b}{\tt{test\_sidecar\_container}}$$ $$\textcolor{#23d18b}{\tt{1}}$$ $$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{TOTAL}}$$ $$\textcolor{#23d18b}{\tt{11}}$$ $$\textcolor{#23d18b}{\tt{11}}$$

github-actions[bot] avatar Sep 10 '25 02:09 github-actions[bot]

@muhammad-ali-e

  1. I was thinking the connection pool mechanism would automatically internally try to connect to db after failures. Ideally we should have only tweaked env settings for retry interval, retry count etc. Instead seemingly we are interfering directly with the connection pool lifecycle, which might over complicate things.

  2. Also I am hoping that if we continue to leave the retrying to the connection pool itself, then we need not sprinkle decorators in each and every call site but instead have it in a single central place.

Let's try to keep it simple and maybe revisit the approach?

hari-kuriakose avatar Sep 10 '25 08:09 hari-kuriakose