milvus icon indicating copy to clipboard operation
milvus copied to clipboard

fix: resolve SessionWatcher goroutine leak and unstable UT in querycoordv2

Open congqixia opened this issue 1 week ago • 10 comments

Related to #44620 Related to unstable ut "internal/querycoordv2 TestServer/TestNodeUp"

Introduce SessionWatcher interface to fix race condition and goroutine leak that caused unstable unit test TestServer/TestNodeUp.

Changes:

  • Add SessionWatcher interface with EventChannel() and Stop() methods
  • Refactor WatchServices() to return SessionWatcher instead of raw channel
  • Fix cleanup order in QueryCoordV2: stop watcher before session
  • Update DataCoord, ConnectionManager to use SessionWatcher
  • Add MockSessionWatcher for testing

Fixes race condition between session context cancellation and internal loop exit. Eliminates goroutine leak by providing explicit lifecycle management.

congqixia avatar Nov 17 '25 11:11 congqixia

[ci-v2-notice] Notice: We are gradually rolling out the new ci-v2 system.

  • Legacy CI jobs remain unaffected, you can just ignore ci-v2 if you don't want to run it.
  • Additional "ci-v2/*" checkers will run for this PR to ensure the new ci-v2 system is working as expected.
  • For tests that exist in both v1 and v2, passing in either system is considered PASS.

To rerun ci-v2 checks, comment with:

  • /ci-rerun-code-check // for ci-v2/code-check
  • /ci-rerun-build // for ci-v2/build
  • /ci-rerun-ut-integration // for ci-v2/ut-integration
  • /ci-rerun-ut-go // for ci-v2/ut-go
  • /ci-rerun-ut-cpp // for ci-v2/ut-cpp
  • /ci-rerun-ut // for all ci-v2/ut-integration, ci-v2/ut-go, ci-v2/ut-cpp
  • /ci-rerun-e2e-arm // for ci-v2/e2e-arm

If you have any questions or requests, please contact @zhikunyao.

sre-ci-robot avatar Nov 17 '25 11:11 sre-ci-robot

/ci-rerun-ut-go

congqixia avatar Nov 17 '25 11:11 congqixia

@congqixia cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Nov 17 '25 11:11 mergify[bot]

/run-cpu-e2e

congqixia avatar Nov 17 '25 11:11 congqixia

Codecov Report

:x: Patch coverage is 95.65217% with 2 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 76.50%. Comparing base (caed0fe) to head (f6a9728). :warning: Report is 31 commits behind head on master.

Files with missing lines Patch % Lines
cmd/tools/migration/migration/runner.go 0.00% 2 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #45627       +/-   ##
===========================================
- Coverage   83.18%   76.50%    -6.68%     
===========================================
  Files         521     1875     +1354     
  Lines       81313   292178   +210865     
===========================================
+ Hits        67642   223539   +155897     
- Misses      13671    61240    +47569     
- Partials        0     7399     +7399     
Components Coverage Δ
Client 78.17% <ø> (∅)
Core 83.19% <98.38%> (+0.01%) :arrow_up:
Go 74.62% <95.52%> (∅)
Files with missing lines Coverage Δ
internal/datacoord/server.go 68.00% <100.00%> (ø)
internal/distributed/connection_manager.go 71.27% <100.00%> (ø)
internal/querycoordv2/server.go 76.03% <100.00%> (ø)
internal/util/sessionutil/session_util.go 75.59% <100.00%> (ø)
cmd/tools/migration/migration/runner.go 0.00% <0.00%> (ø)

... and 1349 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 17 '25 11:11 codecov[bot]

@congqixia cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Nov 17 '25 12:11 mergify[bot]

@congqixia cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Nov 17 '25 18:11 mergify[bot]

@congqixia cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Nov 18 '25 04:11 mergify[bot]

[ci-v2-notice] Notice: We are gradually rolling out the new ci-v2 system.

  • Legacy CI jobs remain unaffected, you can just ignore ci-v2 if you don't want to run it.
  • Additional "ci-v2/*" checkers will run for this PR to ensure the new ci-v2 system is working as expected.
  • For tests that exist in both v1 and v2, passing in either system is considered PASS.

To rerun ci-v2 checks, comment with:

  • /ci-rerun-code-check // for ci-v2/code-check
  • /ci-rerun-build // for ci-v2/build
  • /ci-rerun-ut-integration // for ci-v2/ut-integration
  • /ci-rerun-ut-go // for ci-v2/ut-go
  • /ci-rerun-ut-cpp // for ci-v2/ut-cpp
  • /ci-rerun-ut // for all ci-v2/ut-integration, ci-v2/ut-go, ci-v2/ut-cpp
  • /ci-rerun-e2e-arm // for ci-v2/e2e-arm

If you have any questions or requests, please contact @zhikunyao.

sre-ci-robot avatar Nov 20 '25 05:11 sre-ci-robot

/ci-rerun-ut-integration

congqixia avatar Nov 21 '25 03:11 congqixia

@congqixia cpu-e2e job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Nov 21 '25 04:11 mergify[bot]

/run-cpu-e2e

congqixia avatar Nov 21 '25 06:11 congqixia

/ci-rerun-ut-integration

congqixia avatar Nov 21 '25 06:11 congqixia

/ci-rerun-ut-integration

congqixia avatar Nov 21 '25 07:11 congqixia

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: congqixia, liliu-z

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • ~~OWNERS~~ [congqixia,liliu-z]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

sre-ci-robot avatar Nov 21 '25 10:11 sre-ci-robot