etcd
etcd copied to clipboard
etcdserver: terminate recvLoop on serverWatchStream.close()
Under some conditions serverWatchStream.close() leaves recvLoop goroutine blocked on sending data to ctrlStream channel:
goroutine profile: total 177241
43832 @ 0x43fe6e 0x40a6a5 0x40a2f7 0xd9bb8d 0xd9af66 0x473181
# 0xd9bb8c go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*serverWatchStream).recvLoop+0x70c external/io_etcd_go_etcd_server_v3/etcdserver/api/v3rpc/watch.go:348
# 0xd9af65 go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*watchServer).Watch.func2+0x45 external/io_etcd_go_etcd_server_v3/etcdserver/api/v3rpc/watch.go:191
corresponding code: https://github.com/etcd-io/etcd/blob/40b4715ca371c950c1965cd79519f5f0f9a57921/server/etcdserver/api/v3rpc/watch.go#L349-L360
Reading from the ctrlStream channel is implemented in sendLoop, which is terminated on closec closure:
https://github.com/etcd-io/etcd/blob/40b4715ca371c950c1965cd79519f5f0f9a57921/server/etcdserver/api/v3rpc/watch.go#L529-L531
Fixes https://github.com/etcd-io/etcd/issues/18704
Hi @veshij. Thanks for your PR.
I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
:warning: Please install the to ensure uploads and comments are reliably processed by Codecov.
Codecov Report
Attention: Patch coverage is 46.51163% with 46 lines in your changes missing coverage. Please review.
Project coverage is 68.75%. Comparing base (
04efee2) to head (1168125). Report is 35 commits behind head on main.
:exclamation: Current head 1168125 differs from pull request most recent head 7760735
Please upload reports for the commit 7760735 to get more accurate results.
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
| Files with missing lines | Coverage Δ | |
|---|---|---|
| client/v3/lease.go | 90.87% <100.00%> (ø) |
|
| client/v3/retry.go | 79.48% <100.00%> (+0.17%) |
:arrow_up: |
| client/v3/retry_interceptor.go | 65.61% <100.00%> (ø) |
|
| client/v3/watch.go | 94.23% <100.00%> (+0.39%) |
:arrow_up: |
| pkg/expect/expect.go | 79.12% <100.00%> (ø) |
|
| pkg/featuregate/feature_gate.go | 87.66% <100.00%> (ø) |
|
| pkg/flags/flag.go | 68.57% <100.00%> (ø) |
|
| server/etcdserver/api/v3rpc/watch.go | 84.06% <100.00%> (-1.07%) |
:arrow_down: |
| client/v3/client.go | 84.93% <80.00%> (+0.04%) |
:arrow_up: |
| client/v3/experimental/recipes/key.go | 75.34% <50.00%> (ø) |
|
| ... and 12 more |
... and 22 files with indirect coverage changes
@@ Coverage Diff @@
## main #18739 +/- ##
==========================================
+ Coverage 68.74% 68.75% +0.01%
==========================================
Files 420 420
Lines 35488 35508 +20
==========================================
+ Hits 24395 24415 +20
- Misses 9659 9664 +5
+ Partials 1434 1429 -5
Continue to review full report in Codecov by Sentry.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 04efee2...7760735. Read the comment docs.
/ok-to-test
Can you provide a test?
Can you provide a test?
tbh I'n not quite sure yet how to trigger this particular issue in a test env.
Let me look into preparing a test.
Can you provide a test?
tbh I'n not quite sure yet how to trigger this particular issue in a test env.
I think the e2e test should just mimic how your production run etcd, and verify the result by comparing the metrics pointed out in https://github.com/etcd-io/etcd/issues/18704#issuecomment-2419885372
@veshij since this issue is a little hard to reproduce & verify in normal e2e or integration test, and the fix is simple & safe, so it's accepted to approve & merge the fix firstly.
The solution of adding two metrics sendLoopCount and recvLoopCount as pointed out in https://github.com/etcd-io/etcd/issues/18704#issuecomment-2419885372 is doable, but it's a little weird to expose such two internal metrics to users, so we may not want to proceed with that approach. The other solution is to calculate the count of goroutine using pprof, but we need to investigate how to implement it and ensure it's more generic and reusable. We can revisit the test later.
Please signoff the commit. And also can you please backport the fix to 3.5 and 3.4? Thanks
Please signoff the commit.
Please read https://github.com/etcd-io/etcd/pull/18739/checks?check_run_id=31525554356
signed off. Will backport to 3.4 and 3.5
I did couple of attempts on writing a test, but they all had some major issue. I think this change by itself is correct as was confirmed in https://github.com/etcd-io/etcd/issues/18704#issuecomment-2417418078 to prevent goroutine leak.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: ahrtr, serathius, veshij
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [ahrtr,serathius]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
backport to 3.4: https://github.com/etcd-io/etcd/pull/18785 backport to 3.5: https://github.com/etcd-io/etcd/pull/18784
Thanks. Please also update the changelogs for both 3.4 and 3.5