serving icon indicating copy to clipboard operation
serving copied to clipboard

Ensure conections persist until done on queue-proxy drain

Open elijah-rou opened this issue 3 months ago • 15 comments

Fixes: Websockets (and some HTTP) closing abruptly when queue-proxy undergoes drain.

Due to hijacked connections in net/http not being respected when server.Shutdown is called, any active websocket connections currently end as soon as the queue-proxy calls .Shutdown. See https://github.com/gorilla/websocket/issues/448 and https://github.com/golang/go/issues/17721 for details. This patch fixes this issue by introducing an atomic counter of active requests, which increments as a request comes in and decrements as a request handler terminates. During drain, this counter must reach zero or adhere to the revision timeout, in order to call .Shutdown.

Further, this prevents pre-mature closing of connections in the user container due to misconfigured SIGTERM handling, by delaying the SIGTERM send until the queue-proxy has verified it has fully drained.

elijah-rou avatar Sep 12 '25 15:09 elijah-rou

Hi @elijah-rou. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

knative-prow[bot] avatar Sep 12 '25 15:09 knative-prow[bot]

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elijah-rou Once this PR has been reviewed and has the lgtm label, please assign skonto for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

knative-prow[bot] avatar Sep 12 '25 15:09 knative-prow[bot]

Codecov Report

:x: Patch coverage is 81.28655% with 32 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 80.76%. Comparing base (26a8cec) to head (1437e6f). :warning: Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
pkg/queue/sharedmain/main.go 0.00% 23 Missing :warning:
pkg/queue/breaker.go 68.75% 5 Missing :warning:
pkg/activator/net/throttler.go 71.42% 3 Missing and 1 partial :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16080      +/-   ##
==========================================
+ Coverage   80.20%   80.76%   +0.55%     
==========================================
  Files         214      215       +1     
  Lines       16887    17038     +151     
==========================================
+ Hits        13544    13760     +216     
+ Misses       2987     2914      -73     
- Partials      356      364       +8     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Sep 12 '25 15:09 codecov[bot]

/retest

elijah-rou avatar Sep 12 '25 17:09 elijah-rou

@elijah-rou: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

knative-prow[bot] avatar Sep 12 '25 17:09 knative-prow[bot]

/ok-to-test

dprotaso avatar Sep 15 '25 17:09 dprotaso

There are empty aliases in OWNER_ALIASES, cleanup is advised.

knative-prow[bot] avatar Sep 30 '25 16:09 knative-prow[bot]

/retest

dprotaso avatar Oct 01 '25 00:10 dprotaso

I'm going to drop some of the extra commits in this PR - it makes the diff a bit confusing

dprotaso avatar Oct 01 '25 00:10 dprotaso

I cherry-picked commits into a this PR https://github.com/knative/serving/pull/16104 - that drops all the extra vendor changes in this PR. If you feel I haven't dropped anything important feel free to update this PR my force pushing over.

One observation I have is that the upgrade tests seem to be failing due to the changes.

You can see it here: https://prow.knative.dev/pr-history/?org=knative&repo=serving&pr=16080

The ProbeTest ensures we don't drop traffic when updating Knative components including the Revision Pods.

prober.go:171: "http://upgrade-probe.serving-tests.example.com" status = 502, want: 200
prober.go:172: response: status: 502, body: dial tcp 127.0.0.1:8080: connect: connection refused
..
prober.go:186: Stopping all probers
probe.go:63: CheckSLO() error SLI for "TestServingUpgrades/Run/ProbeTest" = 0.999738, wanted >= 1.000000

In CI is pretty stable - https://testgrid.k8s.io/r/knative-own-testgrid/serving#continuous&width=90&include-filter-by-regex=ProbeTest

dprotaso avatar Oct 01 '25 01:10 dprotaso

@elijah-rou: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
upgrade-tests_serving_main 1437e6f19fc04d793d6888789c33c99cc6465024 link true /test upgrade-tests
istio-latest-no-mesh_serving_main 1437e6f19fc04d793d6888789c33c99cc6465024 link true /test istio-latest-no-mesh

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

knative-prow[bot] avatar Oct 01 '25 02:10 knative-prow[bot]

/hold

Does your websocket e2e drain test reliable fail? I'm inclined that we merge that in via a separate PR but we skip the test by default until someone introduces a fix.

dprotaso avatar Oct 01 '25 02:10 dprotaso

I'm wondering if we need to do something like this for websocket handling in the queue proxy

https://go.dev/play/p/RwdLe7OXaPj

dprotaso avatar Oct 01 '25 14:10 dprotaso

I'm wondering if we need to do something like this for websocket handling in the queue proxy

go.dev/play/p/RwdLe7OXaPj

I'll take a look

elijah-rou avatar Oct 02 '25 15:10 elijah-rou

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

knative-prow-robot avatar Oct 10 '25 05:10 knative-prow-robot