feat: retry connections on network errors. Fixes #15011
Fixes #15011
Motivation
Modifications
Verification
Documentation
Not sure how to test this unfortunately, any recommendations here? Maybe spin up test containers, spin down postgres and then spin it up again?
EDIT: I have a testing solution, going to write up a test logger to collect logs, spin down postgres, witness the logs, spin up postgres and witness the logs indicating recovery.
@coderabbitai full review
โ Actions performed
Full review triggered.
[!NOTE]
Other AI code review bot(s) detected
CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.
Walkthrough
This pull request implements database query resilience by introducing a SessionProxy abstraction that automatically reconnects and retries failed queries on network errors. The feature includes configurable retry parameters (max retries, backoff delays, multiplier), network error detection, and integration across persistence, sync, and controller layers.
Changes
| Cohort / File(s) | Summary |
|---|---|
Session Proxy Core Infrastructure util/sqldb/session.go, util/sqldb/session_test.go |
Introduces SessionProxy type with automatic reconnection support, exponential backoff retry logic, network error detection, and methods (Tx, TxWith, With, Reconnect, Session, Close) for safe DB operations. Includes test coverage for PostgreSQL container-based reconnection scenarios. |
Configuration & Documentation .features/pending/feat-refresh-connection.md, config/config.go, docs/workflow-controller-configmap.md |
Adds DBReconnectConfig struct with retry parameters (MaxRetries, BaseDelaySeconds, MaxDelaySeconds, RetryMultiple) to DBConfig, PersistConfig, and SyncConfig. Includes feature documentation describing reconnect and retry behavior. |
Persistence Layer Migration persist/sqldb/{workflow_archive.go, offload_node_status_repo.go, archived_workflow_labels.go, migrate.go} |
Migrates all persistence repository operations from direct db.Session to SessionProxy, updating constructor signatures and internal DB operations to use sessionProxy.With(...) patterns while preserving existing business logic. |
Sync Database Layer Migration util/sync/db/{queries.go, config.go, migrate.go, mocks/SyncQueries.go} |
Converts SyncQueries and related database operations from db.Session to *sqldb.SessionProxy, updating all query methods (GetCurrentState, GetPendingInQueue, UpdateStateToHeld, InsertHeldState, etc.) with new session proxy signatures and mock definitions. |
Workflow Sync Components Migration workflow/sync/{sync_manager.go, database_semaphore.go, database_mutex.go, transaction.go} |
Replaces db.Session with SessionProxy across lock manager, semaphore, and mutex implementations. NewLockManager now accepts ensureDBConnection flag and returns error; all internal database operations updated to use sessionProxy-based flows. |
Server & Controller Integration server/apiserver/argoserver.go, server/sync/sync_server.go, workflow/controller/{config.go, controller.go} |
Integrates SessionProxy creation and usage at initialization points; replaces direct session creation with sqldb.NewSessionProxy configured with KubectlConfig, Namespace, and DBConfig. |
Test Updates & Utilities hack/db/main.go, test/e2e/fixtures/persistence.go, workflow/controller/{operator_test.go, operator_concurrency_test.go}, workflow/sync/{*_test.go} |
Updates test fixtures and all test files to use SessionProxy instead of direct sessions; adds error handling for NewLockManager updates; adjusts test database setup and assertions to work with proxy-based session access. |
Dependency Management go.mod |
Converts github.com/docker/docker v28.1.1+incompatible and github.com/docker/go-connections v0.5.0 from indirect to direct dependencies. |
Sequence Diagram(s)
sequenceDiagram
participant Client as Application<br/>(Query)
participant SP as SessionProxy
participant DB as Database
participant Net as Network
rect rgb(220, 240, 250)
Note over Client,Net: Normal Query Flow
Client->>SP: With(ctx, func(session))
SP->>SP: Check if in transaction
SP->>DB: Execute query via session
DB-->>SP: Result
SP-->>Client: Success
end
rect rgb(250, 240, 220)
Note over Client,Net: Network Error & Reconnect Flow
Client->>SP: With(ctx, func(session))
SP->>DB: Execute query via session
DB-xNet: Network error occurs
Net-->>SP: Error (timeout/connection lost)
SP->>SP: isNetworkError(err) โ true
SP->>SP: Reconnect(ctx)<br/>with exponential backoff
SP->>SP: Attempt 1: wait baseDelay ms
SP->>Net: New connection attempt
alt Connection succeeds
Net-->>DB: Connected
SP->>DB: Retry query
DB-->>SP: Result
SP-->>Client: Success (after reconnect)
else Max retries exhausted
SP->>SP: Close old session
SP-->>Client: Error (retry limit reached)
end
end
Estimated code review effort
๐ฏ 4 (Complex) | โฑ๏ธ ~60 minutes
Areas requiring extra attention:
- util/sqldb/session.go โ Core reconnection logic with network error detection, backoff calculations, and transaction safety; verify exponential backoff formula and network error classification comprehensiveness
- workflow/sync/sync_manager.go โ Critical control flow change to NewLockManager signature (added ensureDBConnection flag and error return); validate error propagation paths and transaction handling via sessionProxy.TxWith
- workflow/sync/database_semaphore.go โ Extensive refactoring of semaphore acquire/release logic to use sessionProxy; verify state consistency in concurrent scenarios and query parameter flow through proxy
- persist/sqldb/workflow_archive.go โ Large surface migration touching ArchiveWorkflow, GetWorkflow, ListWorkflows, and related methods; ensure backward-compatibility labels and error handling are preserved across all paths
- util/sync/db/queries.go โ All SyncQueries methods updated to use sessionProxy.With(...); cross-check that each database operation maintains proper error propagation and transaction semantics
- Integration points โ Verify sessionProxy initialization in argoserver.go, sync_server.go, and workflow/controller/config.go handle nil sessions correctly and propagate errors appropriately
Poem
๐ฐ A proxy spun from careful code,
Retries when the network road
Grows weary, backs off with a sigh,
Then reconnects to reach the sky.
Where queries once would simply fail,
Now patience writes a better tale. ๐โจ
Pre-merge checks and finishing touches
โ Failed checks (2 warnings)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Description check | โ ๏ธ Warning | The PR description includes only the template boilerplate with 'Fixes #15011' filled in. All substantive sections (Motivation, Modifications, Verification, Documentation) are empty placeholders marked with TODO comments. | Complete the Motivation, Modifications, Verification, and Documentation sections with concrete details about the implementation, testing approach, and user-facing changes. |
| Docstring Coverage | โ ๏ธ Warning | Docstring coverage is 30.77% which is insufficient. The required threshold is 80.00%. | You can run @coderabbitai generate docstrings to improve docstring coverage. |
โ Passed checks (3 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | โ Passed | The PR title 'feat: retry connections on network errors. Fixes #15011' clearly and concisely describes the main feature addition: implementing automatic retry logic for database connections on network errors. |
| Linked Issues check | โ Passed | The code changes comprehensively implement the core requirement from #15011: automatic database query retry on network errors via reconnection. SessionProxy with configurable retry parameters, exponential backoff, network error detection, and integration across all persistence layers demonstrates full compliance with the linked objective. |
| Out of Scope Changes check | โ Passed | Changes are tightly scoped to implement connection retry logic. All modifications (SessionProxy abstraction, retry configuration, error detection, migration updates) directly support the network error resilience objective. No unrelated refactoring or feature creep detected. |
โจ Finishing touches
- [ ] ๐ Generate docstrings
๐งช Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
Comment @coderabbitai help to get the list of available commands and usage tips.
@coderabbitai generate docstrings
โ Actions performed
Initiated docstring generation; will generate only if new commits exist.
[!NOTE] Docstrings generation - SUCCESS Generated docstrings for this pull request at https://github.com/argoproj/argo-workflows/pull/15142