apiserver-network-proxy
                                
                                 apiserver-network-proxy copied to clipboard
                                
                                    apiserver-network-proxy copied to clipboard
                            
                            
                            
                        fix: set timeout on sendmsg to avoid memory leak
- PR tries to fix https://github.com/kubernetes-sigs/apiserver-network-proxy/issues/255
- I have found that if the server and agent tunnel hangs, new requests will cause memory leak in Server
- I made the change based on the following discussion: https://github.com/grpc/grpc-go/issues/1229#issuecomment-302755717
- once this change is there, the memory leak does not happen
- note: this change does not solve the issue which make the tunnel hanging
Hi @mihivagyok. Thanks for your PR.
I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@mihivagyok are you able to share steps you used to reproduce the memory leak and test your fix? I'd like to test/validate your patch against one of my development clusters
@andrewsykim Will share the steps but let me re-validate the scenario. I let you know once I'm ready!
Thanks, Adam
@andrewsykim
Configuration:
- konnectivity-server is configured with --mode=http-connectand--proxy-strategies=destHost,default
- there is one konnectivity-agent with --agent-identifiers=ipv4=<NODE_IP>
- konnectivity server has memory limit of 1 gb
Test:
- using kubectl cpto copy huge files to a test pod (̀forkubectl cp, Konnecticity servergets agent based on the destHost strategy - request matches with the agent identifier)
- eventually the kubectl cpfails to finish
- after that kubectl logsis failing (forkubectl cp, Konnecticity server also gets agent based on the destHost strategy - request matches with the agent identifier)
- among kubectl cp/log, there is a metrics server API server which generates traffic towards the cluster, but for that Konnectivity server picks random agent (default strategy)
- to make the leak, run thousands of kubectl logcommand - will fail, but it will increase the memory/socket usage heavily
- after some time, it will reach the memory limit, and kubernetes restarts konnectivity server
https://github.com/kubernetes-sigs/apiserver-network-proxy/issues/255#issuecomment-947801403
off topic:
I think we have some idea why kubectl cp fails eventually: the issue comes when we have multiple proxy strategies. The backend (agent) connection are guarded by mutex. But if there are multiple strategy, then the same backend with the same connection will be registered to multiple backend managers. To illustrate this:
BackendManager - DestHost
    - Backend1 (agent identifier, conn, mutex)
BackedManager - Default
    - Backend1* (conn, mutex)
We have two agent instances (one for DestHost, one for Default backend manager), but with the same conn. In this case, it is possible that different go routines are trying to write to the same connection as there is no protection between the instances (there is mutex only within the instance), error could happen. I could submit an issue about this and discuss this theory there.
Thanks, Adam
@mihivagyok I think I was able to reproduce this issue, will try to test your patch to see if it resolves the issue
note: this change does not solve the issue which make the tunnel hanging
Are you able to check the konnectivity_network_proxy_server_grpc_connections metric on the proxy server with this patch?
@mihivagyok I think I'm able to reproduce this issue now, and it was possible while just using the default proxy strategy. From the goroutine stacktrace, this was the semaphore lock that was blocking goroutines:
goroutine 192765 [semacquire, 125 minutes]:
sync.runtime_SemacquireMutex(0xc00038c094, 0x1428700, 0x1)
        /usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00038c090)
        /usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:81
sigs.k8s.io/apiserver-network-proxy/pkg/server.(*backend).Send(0xc00038c090, 0xc003b6fb00, 0x0, 0x0)
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/backend_manager.go:86 +0xb9
sigs.k8s.io/apiserver-network-proxy/pkg/server.(*ProxyServer).serveRecvFrontend(0xc000283b00, 0x190d458, 0xc003a5d930, 0xc0072ddc20)
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/server.go:432 +0xacc
created by sigs.k8s.io/apiserver-network-proxy/pkg/server.(*ProxyServer).Proxy
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/server.go:367 +0x318
My steps to reproduce was to mimic some level of backend unavailability.
@andrewsykim That's great news. I think the semaphore is working according to the design: one send is blocking, but an other request is coming and the mutex is guarding and the go routine gets blocked.
I think my concern regarding multiple proxy strategies / backend managers are still vaild here.
To fix this, I think one backend manager is needed which could be configured with multiple strategies - so the connection would be used only in one manager. The one backend manager would select from its agents based on the strategies. I think the code could be changed easily to achieve this.
Do you think that if it is feasible or not? Thank you!
To fix this, I think one backend manager is needed which could be configured with multiple strategies - so the connection would be used only in one manager. The one backend manager would select from its agents based on the strategies. I think the code could be changed easily to achieve this.
Do you think that if it is feasible or not?
I'm still getting familiar with the codebase so I'm not 100% sure yet. But if you're willing to open the PR it would be helpful for me to understand your proposal better
Btw -- it seems like the goroutine leaking due to backend connection mutex can happen for multiple reasons. In my specific test, it was due to write quota on the stream, I created a separate issue for that here https://github.com/kubernetes-sigs/apiserver-network-proxy/issues/335
/cc @cheftako
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity, lifecycle/staleis applied
- After 30d of inactivity since lifecycle/stalewas applied,lifecycle/rottenis applied
- After 30d of inactivity since lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with /remove-lifecycle stale
- Mark this issue or PR as rotten with /lifecycle rotten
- Close this issue or PR with /close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity, lifecycle/staleis applied
- After 30d of inactivity since lifecycle/stalewas applied,lifecycle/rottenis applied
- After 30d of inactivity since lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with /remove-lifecycle rotten
- Close this issue or PR with /close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity, lifecycle/staleis applied
- After 30d of inactivity since lifecycle/stalewas applied,lifecycle/rottenis applied
- After 30d of inactivity since lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue or PR with /reopen
- Mark this issue or PR as fresh with /remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closed this PR.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue or PR with
/reopen- Mark this issue or PR as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/remove-lifecycle rotten
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: mihivagyok
Once this PR has been reviewed and has the lgtm label, please assign anfernee for approval by writing /assign @anfernee in a comment. For more information see:The Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity, lifecycle/staleis applied
- After 30d of inactivity since lifecycle/stalewas applied,lifecycle/rottenis applied
- After 30d of inactivity since lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with /remove-lifecycle stale
- Mark this issue or PR as rotten with /lifecycle rotten
- Close this issue or PR with /close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
@mihivagyok: PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity, lifecycle/staleis applied
- After 30d of inactivity since lifecycle/stalewas applied,lifecycle/rottenis applied
- After 30d of inactivity since lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with /remove-lifecycle rotten
- Close this issue or PR with /close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity, lifecycle/staleis applied
- After 30d of inactivity since lifecycle/stalewas applied,lifecycle/rottenis applied
- After 30d of inactivity since lifecycle/rottenwas applied, the PR is closed
You can:
- Reopen this PR with /reopen
- Mark this PR as fresh with /remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closed this PR.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closedYou can:
- Reopen this PR with
/reopen- Mark this PR as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.