apiserver-network-proxy fix: set timeout on sendmsg to avoid memory leak

PR tries to fix https://github.com/kubernetes-sigs/apiserver-network-proxy/issues/255
I have found that if the server and agent tunnel hangs, new requests will cause memory leak in Server
I made the change based on the following discussion: https://github.com/grpc/grpc-go/issues/1229#issuecomment-302755717
once this change is there, the memory leak does not happen
note: this change does not solve the issue which make the tunnel hanging

Feb 10 '22 08:02 mihivagyok

Hi @mihivagyok. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 10 '22 08:02 k8s-ci-robot

@mihivagyok are you able to share steps you used to reproduce the memory leak and test your fix? I'd like to test/validate your patch against one of my development clusters

Feb 10 '22 19:02 andrewsykim

@andrewsykim Will share the steps but let me re-validate the scenario. I let you know once I'm ready!

Thanks, Adam

Feb 11 '22 19:02 mihivagyok

@andrewsykim

Configuration:

konnectivity-server is configured with --mode=http-connect and --proxy-strategies=destHost,default
there is one konnectivity-agent with --agent-identifiers=ipv4=<NODE_IP>
konnectivity server has memory limit of 1 gb

Test:

using kubectl cp to copy huge files to a test pod (̀for kubectl cp, Konnecticity servergets agent based on the destHost strategy - request matches with the agent identifier)
eventually the kubectl cp fails to finish
after that kubectl logs is failing (for kubectl cp, Konnecticity server also gets agent based on the destHost strategy - request matches with the agent identifier)
among kubectl cp/log, there is a metrics server API server which generates traffic towards the cluster, but for that Konnectivity server picks random agent (default strategy)
to make the leak, run thousands of kubectl log command - will fail, but it will increase the memory/socket usage heavily
after some time, it will reach the memory limit, and kubernetes restarts konnectivity server

https://github.com/kubernetes-sigs/apiserver-network-proxy/issues/255#issuecomment-947801403

off topic: I think we have some idea why kubectl cp fails eventually: the issue comes when we have multiple proxy strategies. The backend (agent) connection are guarded by mutex. But if there are multiple strategy, then the same backend with the same connection will be registered to multiple backend managers. To illustrate this:

BackendManager - DestHost
    - Backend1 (agent identifier, conn, mutex)
BackedManager - Default
    - Backend1* (conn, mutex)

We have two agent instances (one for DestHost, one for Default backend manager), but with the same conn. In this case, it is possible that different go routines are trying to write to the same connection as there is no protection between the instances (there is mutex only within the instance), error could happen. I could submit an issue about this and discuss this theory there.

Thanks, Adam

Feb 14 '22 15:02 mihivagyok

@mihivagyok I think I was able to reproduce this issue, will try to test your patch to see if it resolves the issue

Feb 16 '22 19:02 andrewsykim

note: this change does not solve the issue which make the tunnel hanging

Are you able to check the konnectivity_network_proxy_server_grpc_connections metric on the proxy server with this patch?

Feb 16 '22 19:02 andrewsykim

@mihivagyok I think I'm able to reproduce this issue now, and it was possible while just using the default proxy strategy. From the goroutine stacktrace, this was the semaphore lock that was blocking goroutines:

goroutine 192765 [semacquire, 125 minutes]:
sync.runtime_SemacquireMutex(0xc00038c094, 0x1428700, 0x1)
        /usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00038c090)
        /usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:81
sigs.k8s.io/apiserver-network-proxy/pkg/server.(*backend).Send(0xc00038c090, 0xc003b6fb00, 0x0, 0x0)
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/backend_manager.go:86 +0xb9
sigs.k8s.io/apiserver-network-proxy/pkg/server.(*ProxyServer).serveRecvFrontend(0xc000283b00, 0x190d458, 0xc003a5d930, 0xc0072ddc20)
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/server.go:432 +0xacc
created by sigs.k8s.io/apiserver-network-proxy/pkg/server.(*ProxyServer).Proxy
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/server.go:367 +0x318

My steps to reproduce was to mimic some level of backend unavailability.

Feb 24 '22 01:02 andrewsykim

@andrewsykim That's great news. I think the semaphore is working according to the design: one send is blocking, but an other request is coming and the mutex is guarding and the go routine gets blocked.

I think my concern regarding multiple proxy strategies / backend managers are still vaild here.

To fix this, I think one backend manager is needed which could be configured with multiple strategies - so the connection would be used only in one manager. The one backend manager would select from its agents based on the strategies. I think the code could be changed easily to achieve this.

Do you think that if it is feasible or not? Thank you!

Feb 25 '22 12:02 mihivagyok

To fix this, I think one backend manager is needed which could be configured with multiple strategies - so the connection would be used only in one manager. The one backend manager would select from its agents based on the strategies. I think the code could be changed easily to achieve this.

Do you think that if it is feasible or not?

I'm still getting familiar with the codebase so I'm not 100% sure yet. But if you're willing to open the PR it would be helpful for me to understand your proposal better

Feb 25 '22 17:02 andrewsykim

Btw -- it seems like the goroutine leaking due to backend connection mutex can happen for multiple reasons. In my specific test, it was due to write quota on the stream, I created a separate issue for that here https://github.com/kubernetes-sigs/apiserver-network-proxy/issues/335

Feb 25 '22 17:02 andrewsykim

/cc @cheftako

Feb 25 '22 21:02 andrewsykim

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 05 '22 16:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jul 05 '22 17:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Aug 04 '22 17:08 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 04 '22 17:08 k8s-ci-robot

/remove-lifecycle rotten

Sep 09 '22 22:09 cheftako

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mihivagyok Once this PR has been reviewed and has the lgtm label, please assign anfernee for approval by writing /assign @anfernee in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Sep 09 '22 22:09 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 08 '22 22:12 k8s-triage-robot

@mihivagyok: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Dec 09 '22 10:12 k8s-ci-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jan 08 '23 10:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Feb 07 '23 11:02 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 07 '23 11:02 k8s-ci-robot

apiserver-network-proxy apiserver-network-proxy copied to clipboard

fix: set timeout on sendmsg to avoid memory leak

apiserver-network-proxy
apiserver-network-proxy copied to clipboard