Ensure conections persist until done on queue-proxy drain
Fixes: Websockets (and some HTTP) closing abruptly when queue-proxy undergoes drain.
Due to hijacked connections in net/http not being respected when server.Shutdown is called, any active websocket connections currently end as soon as the queue-proxy calls .Shutdown. See https://github.com/gorilla/websocket/issues/448 and https://github.com/golang/go/issues/17721 for details. This patch fixes this issue by introducing an atomic counter of active requests, which increments as a request comes in and decrements as a request handler terminates. During drain, this counter must reach zero or adhere to the revision timeout, in order to call .Shutdown.
Further, this prevents pre-mature closing of connections in the user container due to misconfigured SIGTERM handling, by delaying the SIGTERM send until the queue-proxy has verified it has fully drained.
Hi @elijah-rou. Thanks for your PR.
I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: elijah-rou Once this PR has been reviewed and has the lgtm label, please assign skonto for approval. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Codecov Report
:x: Patch coverage is 81.28655% with 32 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 80.76%. Comparing base (26a8cec) to head (1437e6f).
:warning: Report is 16 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #16080 +/- ##
==========================================
+ Coverage 80.20% 80.76% +0.55%
==========================================
Files 214 215 +1
Lines 16887 17038 +151
==========================================
+ Hits 13544 13760 +216
+ Misses 2987 2914 -73
- Partials 356 364 +8
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
/retest
@elijah-rou: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.
In response to this:
/retest
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/ok-to-test
There are empty aliases in OWNER_ALIASES, cleanup is advised.
/retest
I'm going to drop some of the extra commits in this PR - it makes the diff a bit confusing
I cherry-picked commits into a this PR https://github.com/knative/serving/pull/16104 - that drops all the extra vendor changes in this PR. If you feel I haven't dropped anything important feel free to update this PR my force pushing over.
One observation I have is that the upgrade tests seem to be failing due to the changes.
You can see it here: https://prow.knative.dev/pr-history/?org=knative&repo=serving&pr=16080
The ProbeTest ensures we don't drop traffic when updating Knative components including the Revision Pods.
prober.go:171: "http://upgrade-probe.serving-tests.example.com" status = 502, want: 200
prober.go:172: response: status: 502, body: dial tcp 127.0.0.1:8080: connect: connection refused
..
prober.go:186: Stopping all probers
probe.go:63: CheckSLO() error SLI for "TestServingUpgrades/Run/ProbeTest" = 0.999738, wanted >= 1.000000
In CI is pretty stable - https://testgrid.k8s.io/r/knative-own-testgrid/serving#continuous&width=90&include-filter-by-regex=ProbeTest
@elijah-rou: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| upgrade-tests_serving_main | 1437e6f19fc04d793d6888789c33c99cc6465024 | link | true | /test upgrade-tests |
| istio-latest-no-mesh_serving_main | 1437e6f19fc04d793d6888789c33c99cc6465024 | link | true | /test istio-latest-no-mesh |
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
/hold
Does your websocket e2e drain test reliable fail? I'm inclined that we merge that in via a separate PR but we skip the test by default until someone introduces a fix.
I'm wondering if we need to do something like this for websocket handling in the queue proxy
https://go.dev/play/p/RwdLe7OXaPj
I'm wondering if we need to do something like this for websocket handling in the queue proxy
I'll take a look
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.