apiserver-network-proxy icon indicating copy to clipboard operation
apiserver-network-proxy copied to clipboard

DIAL_CLS / DIAL_RSP race leading to connection leak

Open tallclair opened this issue 3 years ago • 5 comments

There could be a race condition where a DIAL_CLS packet from the frontend is received at the same time as a DIAL_RSP from the backend that could lead to the backend connection being leaked:

This could happen if the following conditions happen in this order:

  1. DIAL_RSP received from the backend
  2. The pending dial is still present in https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/b5e5436b2fbeaa03657ff2381cef0d46f18ce267/pkg/server/server.go#L755
  3. Frontend starts shutting down, sends a DIAL_CLS (prior to https://github.com/kubernetes-sigs/apiserver-network-proxy/pull/398 it wouldn't even send a close request)
  4. Server sends the dial response the frontend - The FE gRPC stream is still open so the packet is received, but the frontend doesn't process it: https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/b5e5436b2fbeaa03657ff2381cef0d46f18ce267/pkg/server/server.go#L767
  5. At this point, the server thinks the connection is established, but the frontend is not aware of that, and in the process of shutting down, leading to a leaked backend connection.

This seems fairly unlikely (at least once https://github.com/kubernetes-sigs/apiserver-network-proxy/issues/403 is fixed), but worth tracking.

tallclair avatar Sep 16 '22 21:09 tallclair

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 15 '22 22:12 k8s-triage-robot

/lifecycle frozen

tallclair avatar Dec 16 '22 00:12 tallclair

/assign @jkh52

jkh52 avatar Feb 14 '23 19:02 jkh52

/unassign @jkh52 /assign @azimjohn

tallclair avatar May 16 '24 16:05 tallclair

@tallclair: GitHub didn't allow me to assign the following users: azimjohn.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/unassign @jkh52 /assign @azimjohn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar May 16 '24 16:05 k8s-ci-robot