milvus icon indicating copy to clipboard operation
milvus copied to clipboard

enhance: optimize CPU usage for CheckHealth requests

Open jaime0815 opened this issue 1 year ago • 10 comments

issue: #35563

  1. Use an internal health checker to monitor the cluster's health state, storing the latest state on the coordinator node. The CheckHealth request retrieves the cluster's health from this latest state on the proxy sides, which enhances cluster stability.
  2. Each health check will assess all collections and channels, with detailed failure messages temporarily saved in the latest state.
  3. Use CheckHealth request instead of the heavy GetMetrics request on the querynode and datanode

jaime0815 avatar Aug 20 '24 08:08 jaime0815

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Aug 20 '24 10:08 mergify[bot]

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Aug 23 '24 04:08 mergify[bot]

/run-cpu-e2e

jaime0815 avatar Aug 23 '24 06:08 jaime0815

Codecov Report

Attention: Patch coverage is 86.36364% with 54 lines in your changes missing coverage. Please review.

Project coverage is 80.92%. Comparing base (9c8c1b3) to head (d6f6ebf). Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
internal/util/healthcheck/checker.go 74.46% 29 Missing and 7 partials :warning:
internal/querycoordv2/session/cluster.go 75.00% 2 Missing and 1 partial :warning:
internal/querynodev2/metrics_info.go 62.50% 2 Missing and 1 partial :warning:
pkg/util/merr/utils.go 40.00% 2 Missing and 1 partial :warning:
internal/datacoord/session/datanode_manager.go 92.59% 2 Missing :warning:
internal/util/mock/grpc_datanode_client.go 0.00% 2 Missing :warning:
internal/util/mock/grpc_querynode_client.go 0.00% 2 Missing :warning:
internal/util/wrappers/qn_wrapper.go 0.00% 2 Missing :warning:
internal/querycoordv2/utils/util.go 87.50% 1 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #35589      +/-   ##
==========================================
+ Coverage   80.89%   80.92%   +0.03%     
==========================================
  Files        1373     1374       +1     
  Lines      193162   193362     +200     
==========================================
+ Hits       156264   156485     +221     
+ Misses      31369    31361       -8     
+ Partials     5529     5516      -13     
Components Coverage Δ
Client 74.58% <ø> (ø)
Core 68.97% <ø> (ø)
Go 83.02% <86.36%> (+0.03%) :arrow_up:
Files with missing lines Coverage Δ
internal/datacoord/server.go 73.40% <100.00%> (+0.17%) :arrow_up:
internal/datacoord/services.go 85.49% <100.00%> (+0.03%) :arrow_up:
internal/datacoord/util.go 98.68% <100.00%> (ø)
internal/datanode/metrics_info.go 96.20% <100.00%> (ø)
internal/datanode/services.go 85.48% <100.00%> (+0.47%) :arrow_up:
internal/distributed/datanode/client/client.go 89.93% <100.00%> (+0.25%) :arrow_up:
internal/distributed/datanode/service.go 82.64% <100.00%> (+0.14%) :arrow_up:
internal/distributed/querynode/client/client.go 91.70% <100.00%> (+0.14%) :arrow_up:
internal/distributed/querynode/service.go 83.71% <100.00%> (+0.14%) :arrow_up:
...nternal/flushcommon/pipeline/flow_graph_manager.go 92.07% <100.00%> (+0.87%) :arrow_up:
... and 19 more

... and 23 files with indirect coverage changes

codecov[bot] avatar Aug 24 '24 14:08 codecov[bot]

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Aug 24 '24 17:08 mergify[bot]

/run-cpu-e2e

jaime0815 avatar Aug 25 '24 01:08 jaime0815

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Sep 03 '24 09:09 mergify[bot]

/run-cpu-e2e

jaime0815 avatar Sep 03 '24 12:09 jaime0815

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Oct 05 '24 08:10 stale[bot]

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Oct 28 '24 12:10 mergify[bot]

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Oct 31 '24 06:10 mergify[bot]

@jaime0815 go-sdk check failed, comment rerun go-sdk can trigger the job again.

mergify[bot] avatar Oct 31 '24 06:10 mergify[bot]

@jaime0815 cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

mergify[bot] avatar Oct 31 '24 06:10 mergify[bot]

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Oct 31 '24 09:10 mergify[bot]

/run-cpu-e2e

jaime0815 avatar Oct 31 '24 13:10 jaime0815

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Nov 01 '24 01:11 mergify[bot]

/run-cpu-e2e

jaime0815 avatar Nov 01 '24 04:11 jaime0815

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Nov 01 '24 04:11 mergify[bot]

/run-cpu-e2e

jaime0815 avatar Nov 04 '24 08:11 jaime0815

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Nov 04 '24 08:11 mergify[bot]

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Dec 04 '24 11:12 mergify[bot]

@jaime0815 go-sdk check failed, comment rerun go-sdk can trigger the job again.

mergify[bot] avatar Dec 04 '24 12:12 mergify[bot]

@jaime0815 go-sdk check failed, comment rerun go-sdk can trigger the job again.

mergify[bot] avatar Dec 04 '24 13:12 mergify[bot]

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Dec 09 '24 03:12 mergify[bot]

@jaime0815 go-sdk check failed, comment rerun go-sdk can trigger the job again.

mergify[bot] avatar Dec 13 '24 09:12 mergify[bot]

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Dec 13 '24 09:12 mergify[bot]

@jaime0815 go-sdk check failed, comment rerun go-sdk can trigger the job again.

mergify[bot] avatar Dec 16 '24 02:12 mergify[bot]

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Dec 16 '24 02:12 mergify[bot]

@jaime0815 go-sdk check failed, comment rerun go-sdk can trigger the job again.

mergify[bot] avatar Dec 16 '24 08:12 mergify[bot]

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify[bot] avatar Dec 16 '24 08:12 mergify[bot]