milvus
milvus copied to clipboard
fix: Prevent Close from hanging on etcd reconnection
issue: #45623 When etcd reconnects, the DataCoord rewatches DataNodes and calls ChannelManager.Startup again without closing the previous instance. This causes multiple contexts and goroutines to accumulate, leading to Close hanging indefinitely waiting for untracked goroutines.
Root cause:
- Etcd reconnection triggers rewatch flow and calls Startup again
- Startup was not idempotent, allowing repeated calls
- Multiple context cancellations and goroutines accumulated
- Close would wait indefinitely for untracked goroutines
Changes:
- Add started field to ChannelManagerImpl
- Refactor Startup to check and handle restart scenario
- Add state check in Close to prevent hanging
@weiliu1031 Please associate the related pr of master to the body of your Pull Request. (eg. "pr: #
[ci-v2-notice] Notice: We are gradually rolling out the new ci-v2 system.
- Legacy CI jobs remain unaffected, you can just ignore ci-v2 if you don't want to run it.
- Additional "ci-v2/*" checkers will run for this PR to ensure the new ci-v2 system is working as expected.
- For tests that exist in both v1 and v2, passing in either system is considered PASS.
To rerun ci-v2 checks, comment with:
- /ci-rerun-code-check // for ci-v2/code-check
- /ci-rerun-build // for ci-v2/build
- /ci-rerun-ut-integration // for ci-v2/ut-integration
- /ci-rerun-ut-go // for ci-v2/ut-go
- /ci-rerun-ut-cpp // for ci-v2/ut-cpp
- /ci-rerun-ut // for all ci-v2/ut-integration, ci-v2/ut-go, ci-v2/ut-cpp
- /ci-rerun-e2e-arm // for ci-v2/e2e-arm
If you have any questions or requests, please contact @zhikunyao.
[INFO] PR Label Summary by Default [WARNING] No dependent PR reference found
- Target branch '2.5' requires a PR merged to master first
- Please add reference in format 'pr: #number'
[WARNING] Milestone not set
- PR: #45622
- Title: fix: Prevent Close from hanging on etcd reconnection Please set a milestone for better release tracking
You can set milestone by commenting:
/set-milestone
Use /refresh-label to update related check and label manually
@weiliu1031 Please associate the related issue to the body of your Pull Request. (eg. "issue: #
/kind branch-feature
/set-milestone 2.5.23
[INFO] Set milestone to: 2.5.23
/refresh-label
[INFO] PR Label Summary by Refresh-Label
- Title: fix: Prevent Close from hanging on etcd reconnection
- Target: 2.5
- Labels: kind/bug, size/L, dco-passed, kind/branch-feature, do-not-merge/need-merge-master-first
[INFO] Dependent PR check skipped - branch feature PR (kind/branch-feature)
Use /refresh-label to update related check and label manually
/ci-rerun-ut-go
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 82.05%. Comparing base (3a7a08f) to head (0822d4e).
:warning: Report is 59 commits behind head on 2.5.
Additional details and impacted files
@@ Coverage Diff @@
## 2.5 #45622 +/- ##
==========================================
- Coverage 82.10% 82.05% -0.05%
==========================================
Files 1128 1587 +459
Lines 179181 248710 +69529
==========================================
+ Hits 147110 204087 +56977
- Misses 26099 38618 +12519
- Partials 5972 6005 +33
| Components | Coverage Δ | |
|---|---|---|
| Client | 78.90% <22.22%> (-0.06%) |
:arrow_down: |
| Core | 84.56% <79.54%> (∅) |
|
| Go | 82.38% <79.16%> (+<0.01%) |
:arrow_up: |
| Files with missing lines | Coverage Δ | |
|---|---|---|
| internal/datacoord/channel_manager.go | 89.59% <100.00%> (+0.58%) |
:arrow_up: |
| internal/datacoord/server.go | 74.16% <ø> (+0.25%) |
:arrow_up: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
[INFO] PR Label Summary by Default [INFO] Dependent PR check skipped - branch feature PR (kind/branch-feature)
Use /refresh-label to update related check and label manually
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: congqixia, weiliu1031
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~internal/datacoord/OWNERS~~ [congqixia]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
[INFO] PR Label Summary by Default [INFO] Dependent PR check skipped - branch feature PR (kind/branch-feature)
Use /refresh-label to update related check and label manually