argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

feat: retry connections on network errors. Fixes #15011

Open isubasinghe opened this issue 2 months ago โ€ข 1 comments

Fixes #15011

Motivation

Modifications

Verification

Documentation

isubasinghe avatar Nov 05 '25 01:11 isubasinghe

Not sure how to test this unfortunately, any recommendations here? Maybe spin up test containers, spin down postgres and then spin it up again?

EDIT: I have a testing solution, going to write up a test logger to collect logs, spin down postgres, witness the logs, spin up postgres and witness the logs indicating recovery.

isubasinghe avatar Nov 17 '25 09:11 isubasinghe

@coderabbitai full review

Joibel avatar Dec 10 '25 13:12 Joibel

โœ… Actions performed

Full review triggered.

coderabbitai[bot] avatar Dec 10 '25 13:12 coderabbitai[bot]

[!NOTE]

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This pull request implements database query resilience by introducing a SessionProxy abstraction that automatically reconnects and retries failed queries on network errors. The feature includes configurable retry parameters (max retries, backoff delays, multiplier), network error detection, and integration across persistence, sync, and controller layers.

Changes

Cohort / File(s) Summary
Session Proxy Core Infrastructure
util/sqldb/session.go, util/sqldb/session_test.go
Introduces SessionProxy type with automatic reconnection support, exponential backoff retry logic, network error detection, and methods (Tx, TxWith, With, Reconnect, Session, Close) for safe DB operations. Includes test coverage for PostgreSQL container-based reconnection scenarios.
Configuration & Documentation
.features/pending/feat-refresh-connection.md, config/config.go, docs/workflow-controller-configmap.md
Adds DBReconnectConfig struct with retry parameters (MaxRetries, BaseDelaySeconds, MaxDelaySeconds, RetryMultiple) to DBConfig, PersistConfig, and SyncConfig. Includes feature documentation describing reconnect and retry behavior.
Persistence Layer Migration
persist/sqldb/{workflow_archive.go, offload_node_status_repo.go, archived_workflow_labels.go, migrate.go}
Migrates all persistence repository operations from direct db.Session to SessionProxy, updating constructor signatures and internal DB operations to use sessionProxy.With(...) patterns while preserving existing business logic.
Sync Database Layer Migration
util/sync/db/{queries.go, config.go, migrate.go, mocks/SyncQueries.go}
Converts SyncQueries and related database operations from db.Session to *sqldb.SessionProxy, updating all query methods (GetCurrentState, GetPendingInQueue, UpdateStateToHeld, InsertHeldState, etc.) with new session proxy signatures and mock definitions.
Workflow Sync Components Migration
workflow/sync/{sync_manager.go, database_semaphore.go, database_mutex.go, transaction.go}
Replaces db.Session with SessionProxy across lock manager, semaphore, and mutex implementations. NewLockManager now accepts ensureDBConnection flag and returns error; all internal database operations updated to use sessionProxy-based flows.
Server & Controller Integration
server/apiserver/argoserver.go, server/sync/sync_server.go, workflow/controller/{config.go, controller.go}
Integrates SessionProxy creation and usage at initialization points; replaces direct session creation with sqldb.NewSessionProxy configured with KubectlConfig, Namespace, and DBConfig.
Test Updates & Utilities
hack/db/main.go, test/e2e/fixtures/persistence.go, workflow/controller/{operator_test.go, operator_concurrency_test.go}, workflow/sync/{*_test.go}
Updates test fixtures and all test files to use SessionProxy instead of direct sessions; adds error handling for NewLockManager updates; adjusts test database setup and assertions to work with proxy-based session access.
Dependency Management
go.mod
Converts github.com/docker/docker v28.1.1+incompatible and github.com/docker/go-connections v0.5.0 from indirect to direct dependencies.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Application<br/>(Query)
    participant SP as SessionProxy
    participant DB as Database
    participant Net as Network

    rect rgb(220, 240, 250)
    Note over Client,Net: Normal Query Flow
    Client->>SP: With(ctx, func(session))
    SP->>SP: Check if in transaction
    SP->>DB: Execute query via session
    DB-->>SP: Result
    SP-->>Client: Success
    end

    rect rgb(250, 240, 220)
    Note over Client,Net: Network Error & Reconnect Flow
    Client->>SP: With(ctx, func(session))
    SP->>DB: Execute query via session
    DB-xNet: Network error occurs
    Net-->>SP: Error (timeout/connection lost)
    SP->>SP: isNetworkError(err) โ†’ true
    SP->>SP: Reconnect(ctx)<br/>with exponential backoff
    SP->>SP: Attempt 1: wait baseDelay ms
    SP->>Net: New connection attempt
    alt Connection succeeds
        Net-->>DB: Connected
        SP->>DB: Retry query
        DB-->>SP: Result
        SP-->>Client: Success (after reconnect)
    else Max retries exhausted
        SP->>SP: Close old session
        SP-->>Client: Error (retry limit reached)
    end
    end

Estimated code review effort

๐ŸŽฏ 4 (Complex) | โฑ๏ธ ~60 minutes

Areas requiring extra attention:

  • util/sqldb/session.go โ€“ Core reconnection logic with network error detection, backoff calculations, and transaction safety; verify exponential backoff formula and network error classification comprehensiveness
  • workflow/sync/sync_manager.go โ€“ Critical control flow change to NewLockManager signature (added ensureDBConnection flag and error return); validate error propagation paths and transaction handling via sessionProxy.TxWith
  • workflow/sync/database_semaphore.go โ€“ Extensive refactoring of semaphore acquire/release logic to use sessionProxy; verify state consistency in concurrent scenarios and query parameter flow through proxy
  • persist/sqldb/workflow_archive.go โ€“ Large surface migration touching ArchiveWorkflow, GetWorkflow, ListWorkflows, and related methods; ensure backward-compatibility labels and error handling are preserved across all paths
  • util/sync/db/queries.go โ€“ All SyncQueries methods updated to use sessionProxy.With(...); cross-check that each database operation maintains proper error propagation and transaction semantics
  • Integration points โ€“ Verify sessionProxy initialization in argoserver.go, sync_server.go, and workflow/controller/config.go handle nil sessions correctly and propagate errors appropriately

Poem

๐Ÿฐ A proxy spun from careful code,
Retries when the network road
Grows weary, backs off with a sigh,
Then reconnects to reach the sky.
Where queries once would simply fail,
Now patience writes a better tale. ๐Ÿ”„โœจ

Pre-merge checks and finishing touches

โŒ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check โš ๏ธ Warning The PR description includes only the template boilerplate with 'Fixes #15011' filled in. All substantive sections (Motivation, Modifications, Verification, Documentation) are empty placeholders marked with TODO comments. Complete the Motivation, Modifications, Verification, and Documentation sections with concrete details about the implementation, testing approach, and user-facing changes.
Docstring Coverage โš ๏ธ Warning Docstring coverage is 30.77% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
โœ… Passed checks (3 passed)
Check name Status Explanation
Title check โœ… Passed The PR title 'feat: retry connections on network errors. Fixes #15011' clearly and concisely describes the main feature addition: implementing automatic retry logic for database connections on network errors.
Linked Issues check โœ… Passed The code changes comprehensively implement the core requirement from #15011: automatic database query retry on network errors via reconnection. SessionProxy with configurable retry parameters, exponential backoff, network error detection, and integration across all persistence layers demonstrates full compliance with the linked objective.
Out of Scope Changes check โœ… Passed Changes are tightly scoped to implement connection retry logic. All modifications (SessionProxy abstraction, retry configuration, error detection, migration updates) directly support the network error resilience objective. No unrelated refactoring or feature creep detected.
โœจ Finishing touches
  • [ ] ๐Ÿ“ Generate docstrings
๐Ÿงช Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Dec 10 '25 13:12 coderabbitai[bot]

@coderabbitai generate docstrings

Joibel avatar Dec 10 '25 13:12 Joibel

โœ… Actions performed

Initiated docstring generation; will generate only if new commits exist.

coderabbitai[bot] avatar Dec 10 '25 13:12 coderabbitai[bot]

[!NOTE] Docstrings generation - SUCCESS Generated docstrings for this pull request at https://github.com/argoproj/argo-workflows/pull/15142

coderabbitai[bot] avatar Dec 10 '25 13:12 coderabbitai[bot]