milvus icon indicating copy to clipboard operation
milvus copied to clipboard

fix: Prevent Close from hanging on etcd reconnection

Open weiliu1031 opened this issue 2 weeks ago • 13 comments

issue: #45623 When etcd reconnects, the DataCoord rewatches DataNodes and calls ChannelManager.Startup again without closing the previous instance. This causes multiple contexts and goroutines to accumulate, leading to Close hanging indefinitely waiting for untracked goroutines.

Root cause:

  • Etcd reconnection triggers rewatch flow and calls Startup again
  • Startup was not idempotent, allowing repeated calls
  • Multiple context cancellations and goroutines accumulated
  • Close would wait indefinitely for untracked goroutines

Changes:

  • Add started field to ChannelManagerImpl
  • Refactor Startup to check and handle restart scenario
  • Add state check in Close to prevent hanging

weiliu1031 avatar Nov 17 '25 08:11 weiliu1031

@weiliu1031 Please associate the related pr of master to the body of your Pull Request. (eg. "pr: #")

mergify[bot] avatar Nov 17 '25 08:11 mergify[bot]

[ci-v2-notice] Notice: We are gradually rolling out the new ci-v2 system.

  • Legacy CI jobs remain unaffected, you can just ignore ci-v2 if you don't want to run it.
  • Additional "ci-v2/*" checkers will run for this PR to ensure the new ci-v2 system is working as expected.
  • For tests that exist in both v1 and v2, passing in either system is considered PASS.

To rerun ci-v2 checks, comment with:

  • /ci-rerun-code-check // for ci-v2/code-check
  • /ci-rerun-build // for ci-v2/build
  • /ci-rerun-ut-integration // for ci-v2/ut-integration
  • /ci-rerun-ut-go // for ci-v2/ut-go
  • /ci-rerun-ut-cpp // for ci-v2/ut-cpp
  • /ci-rerun-ut // for all ci-v2/ut-integration, ci-v2/ut-go, ci-v2/ut-cpp
  • /ci-rerun-e2e-arm // for ci-v2/e2e-arm

If you have any questions or requests, please contact @zhikunyao.

sre-ci-robot avatar Nov 17 '25 08:11 sre-ci-robot

[INFO] PR Label Summary by Default [WARNING] No dependent PR reference found

  • Target branch '2.5' requires a PR merged to master first
  • Please add reference in format 'pr: #number'

[WARNING] Milestone not set

  • PR: #45622
  • Title: fix: Prevent Close from hanging on etcd reconnection Please set a milestone for better release tracking

You can set milestone by commenting: /set-milestone Example: /set-milestone 2.5.0

Use /refresh-label to update related check and label manually

sre-ci-robot avatar Nov 17 '25 08:11 sre-ci-robot

@weiliu1031 Please associate the related issue to the body of your Pull Request. (eg. "issue: #")

mergify[bot] avatar Nov 17 '25 08:11 mergify[bot]

/kind branch-feature

weiliu1031 avatar Nov 17 '25 08:11 weiliu1031

/set-milestone 2.5.23

weiliu1031 avatar Nov 17 '25 08:11 weiliu1031

[INFO] Set milestone to: 2.5.23

sre-ci-robot avatar Nov 17 '25 08:11 sre-ci-robot

/refresh-label

weiliu1031 avatar Nov 17 '25 09:11 weiliu1031

[INFO] PR Label Summary by Refresh-Label

  • Title: fix: Prevent Close from hanging on etcd reconnection
  • Target: 2.5
  • Labels: kind/bug, size/L, dco-passed, kind/branch-feature, do-not-merge/need-merge-master-first

[INFO] Dependent PR check skipped - branch feature PR (kind/branch-feature)

Use /refresh-label to update related check and label manually

sre-ci-robot avatar Nov 17 '25 09:11 sre-ci-robot

/ci-rerun-ut-go

weiliu1031 avatar Nov 17 '25 10:11 weiliu1031

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 82.05%. Comparing base (3a7a08f) to head (0822d4e). :warning: Report is 59 commits behind head on 2.5.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              2.5   #45622      +/-   ##
==========================================
- Coverage   82.10%   82.05%   -0.05%     
==========================================
  Files        1128     1587     +459     
  Lines      179181   248710   +69529     
==========================================
+ Hits       147110   204087   +56977     
- Misses      26099    38618   +12519     
- Partials     5972     6005      +33     
Components Coverage Δ
Client 78.90% <22.22%> (-0.06%) :arrow_down:
Core 84.56% <79.54%> (∅)
Go 82.38% <79.16%> (+<0.01%) :arrow_up:
Files with missing lines Coverage Δ
internal/datacoord/channel_manager.go 89.59% <100.00%> (+0.58%) :arrow_up:
internal/datacoord/server.go 74.16% <ø> (+0.25%) :arrow_up:

... and 509 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 17 '25 10:11 codecov[bot]

[INFO] PR Label Summary by Default [INFO] Dependent PR check skipped - branch feature PR (kind/branch-feature)

Use /refresh-label to update related check and label manually

sre-ci-robot avatar Nov 18 '25 06:11 sre-ci-robot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: congqixia, weiliu1031

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

sre-ci-robot avatar Nov 19 '25 04:11 sre-ci-robot

[INFO] PR Label Summary by Default [INFO] Dependent PR check skipped - branch feature PR (kind/branch-feature)

Use /refresh-label to update related check and label manually

sre-ci-robot avatar Nov 19 '25 04:11 sre-ci-robot