apiserver-network-proxy Proxy() goroutine leaks when serveRecvFrontend() finishes without receiving EOF

While working on #276 I saw some possible issues that weren't hit in practice with my repro, but that seems worth highlighting and probably fixing. I'm opening this issue to know what maintainers think of this.

The thing is that Proxy starts a goroutine for serveRecvFrontend() and just wait until it receives something on the stopCh. Only the goroutine created in-place sends to that stop channel, and that happens only when we receive an error or EOF on the stream. If that doesn't happen, we just wait indefinitely.

I have not made a detailed analysis yet, but it seems likely that if we hit this return in serveRecvFrontend (which is not triggered by a stream EOF or read error), Proxy will run indefinitely. I lack the knowledge to know if some higher-level thing will close this connection if the DIAL_REQ fails after some time or something. But if you can shed some light here, it will be great.

Then, there are other similar cases that might result in a similar "Proxy() not finishing" behavior. I had some notes to investigate this in the past (but left it for later, as it was only theoretical and not the leak we were trying to fix in #276 ) but PR #329 might make it more pressing to see if that is the case now. See the comment I left there: https://github.com/kubernetes-sigs/apiserver-network-proxy/pull/329#discussion_r813817988. Note this is the case not just for that line, but for the other in the PR too. I also, here, lack the knowledge about the protocol to know if we can expect some retry from a higher level or something. Do you know? Or do you have any pointers to see where the apiserver is using this and see what actually happens? There are so many abstractions in the apiserver, that to verify this is not as obvious as I'd like :-/

I'll review some notes to see if I had found other cases like this when looking at the code before, but just raising this to know what you think. Update with some random notes to take a look:

Do we want to do something on connections not recognized for CLOSE_RSP or DATA ? Just silently ignore is fine? Why is that?

cc @andrewsykim as you opened PR #329 :)

Mar 01 '22 15:03 rata

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 01 '22 11:06 k8s-triage-robot

/remove-lifecycle stale

Jun 01 '22 11:06 rata

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 30 '22 12:08 k8s-triage-robot

/remove-lifecycle stale

Sep 05 '22 13:09 rata

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 04 '22 14:12 k8s-triage-robot

/remove-lifecycle stale

Dec 05 '22 10:12 rata

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 05 '23 10:03 k8s-triage-robot

/remove-lifecycle stale

Mar 06 '23 11:03 rata

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 04 '23 11:06 k8s-triage-robot

/remove-lifecycle stale

Jun 23 '23 10:06 rata

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 23 '24 05:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 22 '24 05:02 k8s-triage-robot

This is not fixed, AFAIK. But I'm unsure what is the value of keeping it open, as I don't plan to work on this in the near future and anyone else didn't jump in either. I'll not continue to push to keep it open, even though it might be relevant in the future.

Feb 22 '24 21:02 rata

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Mar 23 '24 21:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 23 '24 21:03 k8s-ci-robot

apiserver-network-proxy apiserver-network-proxy copied to clipboard

Proxy() goroutine leaks when serveRecvFrontend() finishes without receiving EOF

apiserver-network-proxy
apiserver-network-proxy copied to clipboard