apiserver-network-proxy
apiserver-network-proxy copied to clipboard
(WIP - Do not review) Terminate Idle connections by setting a timeout to free up resources
This PR fixes #276 by addressing the issue #311.
We introduce a new flag --grpc-max-idle-time
after which any idle connections from frontend(kube-apiserver) to proxy-server will be terminated.
To test this PR: reproduce the steps:
- checkout this PR/branch locally and build the proxy-server binary
- Follow steps described in the issue #311.
Start the proxy-server with:
./bin/proxy-server --server-port=0 --uds-name=/tmp/uds-proxy --cluster-ca-cert=certs/agent/issued/ca.crt --cluster-cert=certs/agent/issued/proxy-frontend.crt --cluster-key=certs/agent/private/proxy-frontend.key --warn-on-channel-limit --v 5 --keepalive-time 1m --frontend-keepalive-time 1m --enable-profiling --grpc-max-idle-time 3m 2>&1 | tee server.log
Before running the stress test, get the following metrics
$ curl -s localhost:8095/metrics | grep -E "grpc"
# HELP konnectivity_network_proxy_server_grpc_connections Number of current grpc connections, partitioned by service method.
# TYPE konnectivity_network_proxy_server_grpc_connections gauge
konnectivity_network_proxy_server_grpc_connections{service_method="Connect"} 1
konnectivity_network_proxy_server_grpc_connections{service_method="Proxy"} 0
$ curl -s localhost:8095/metrics | grep -E "open_fds"
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 13
Take a note of the metrics, during the stress test:
Once the stress test is complete and enough time has passed to trigger a --grpc-max-idle-time
timeout, you'll notice that the metrics values would be pre-stress test levels.
/cc @rata @iaguis
@ipochi: GitHub didn't allow me to request PR reviews from the following users: rata, iaguis.
Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.
In response to this:
This PR fixes #276 by addressing the issue #311.
We introduce a new flag
--grpc-max-idle-time
after which any idle connections will be terminated.To test this PR: reproduce the steps:
- checkout this PR/branch locally and build the proxy-server binary
- Follow steps described in the issue #311.
Start the proxy-server with:
./bin/proxy-server --server-port=0 --uds-name=/tmp/uds-proxy --cluster-ca-cert=certs/agent/issued/ca.crt --cluster-cert=certs/agent/issued/proxy-frontend.crt --cluster-key=certs/agent/private/proxy-frontend.key --warn-on-channel-limit --v 5 --keepalive-time 1m --frontend-keepalive-time 1m --enable-profiling --grpc-max-idle-time 3m 2>&1 | tee server.log
Before running the stress test, get the following metrics
$ curl -s localhost:8095/metrics | grep -E "grpc" # HELP konnectivity_network_proxy_server_grpc_connections Number of current grpc connections, partitioned by service method. # TYPE konnectivity_network_proxy_server_grpc_connections gauge konnectivity_network_proxy_server_grpc_connections{service_method="Connect"} 1 konnectivity_network_proxy_server_grpc_connections{service_method="Proxy"} 0 $ curl -s localhost:8095/metrics | grep -E "open_fds" # HELP process_open_fds Number of open file descriptors. # TYPE process_open_fds gauge process_open_fds 13
Take a note of the metrics, during the stress test:
Once the stress test is complete and enough time has passed to trigger a
--grpc-max-idle-time
timeout, you'll notice that the metrics values would be pre-stress test levels./cc @rata @iaguis
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Welcome @ipochi!
It looks like this is your first PR to kubernetes-sigs/apiserver-network-proxy 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.
You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.
You can also check if kubernetes-sigs/apiserver-network-proxy has its own contribution guidelines.
You may want to refer to our testing guide if you run into trouble with your tests not passing.
If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!
Thank you, and welcome to Kubernetes. :smiley:
Hi @ipochi. Thanks for your PR.
I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test
on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test
label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/ok-to-test
I'm curious what Walter Fender and Joseph Anttila Hall think about this "big hammer" hack. This is what is mentioned in the 4th item here, I explain why I'd like some big hammer like that (basically, when the leak is in the konn-client, we need to backport to k8s versions and having a fix in the proxy-server is way easier to deploy, besides helping with future regressions too until they are fixed. Currently we had to disable konnectivity until we fix these leaks, with a hammer, we expect that is not needed to disable it but can use the hammer until we fix the leak).
As long as this is an opt in change (which is seems to be) I am pro the approach. I do have a couple of minor concerns. One is this introduces yet another ID for us to track. I would like to see if it corresponds to something we already have. We already have a lot of IDs and things to track. Adding another one just makes debugging the system even harder.
I also would like to see if we can fix the frontend connection to be mux'd on a single gRPC connection between each KAS * KonnServer pair (when gRPC is configured). I don't think this prevents that but it may complicate it slightly.
# sigs.k8s.io/apiserver-network-proxy/tests [sigs.k8s.io/apiserver-network-proxy/tests.test]
tests/proxy_test.go:427:33: not enough arguments in call to server.NewProxyServer
have (string, []server.ProxyStrategy, int, *server.AgentTokenAuthenticationOptions, bool)
want (string, []server.ProxyStrategy, int, *server.AgentTokenAuthenticationOptions, bool, time.Duration)
tests/proxy_test.go:466:28: not enough arguments in call to server.NewProxyServer
have (string, []server.ProxyStrategy, number, *server.AgentTokenAuthenticationOptions, bool)
want (string, []server.ProxyStrategy, int, *server.AgentTokenAuthenticationOptions, bool, time.Duration)
# sigs.k8s.io/apiserver-network-proxy/pkg/server [sigs.k8s.io/apiserver-network-proxy/pkg/server.test]
pkg/server/server_test.go:162:23: not enough arguments in call to NewProxyServer
have (string, []ProxyStrategy, number, *AgentTokenAuthenticationOptions, bool)
want (string, []ProxyStrategy, int, *AgentTokenAuthenticationOptions, bool, time.Duration)
pkg/server/server_test.go:192:21: not enough arguments in call to NewProxyServer
have (string, []ProxyStrategy, number, nil, bool)
want (string, []ProxyStrategy, int, *AgentTokenAuthenticationOptions, bool, time.Duration)
pkg/server/server_test.go:200:20: not enough arguments in call to NewProxyServer
have (string, []ProxyStrategy, number, nil, bool)
want (string, []ProxyStrategy, int, *AgentTokenAuthenticationOptions, bool, time.Duration)
FAIL sigs.k8s.io/apiserver-network-proxy/pkg/server [build failed]
@cheftako cool, good to know this approach looks fine if it is opt-in (as it is, yes :)). Thanks!
And yeah, I'd like this to be simpler too. We will see with @ipochi if this can be simplified or at least reduce the complexity when this opt-in flag is not used.
UPDATE:
Found an issue with this PR which continuously increases memory after 5 or 6 hours. Still trying to debug the cause of this issue.
While the PR terminates idle connections which decrease the open_fds
and vendor/google.golang.org/grpc/internal/transport/http2_server.go
metric , surprisingly this only results is ~10% memory usage reduction despite closing all open grpc connections.
This same behaviour or only ~10% memory usage reduction after the test was noticed by using the out of the box gRPC Keepalive parameters (MaxConnectionAge
and MaxConnectionAgeGrace
). Curently outweighing the pros and cons of both the approaches to terminate idle connections:
This PR : terminates frontend->proxy-server connections based on inactivity after a specified duration. gRPC Params: terminates all frontend->proxy-server connections regardless of activity.
/cc: @rata
@ipochi: PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue or PR with
/reopen
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closed this PR.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue or PR with
/reopen
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/remove-lifecycle rotten
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: ipochi
Once this PR has been reviewed and has the lgtm label, please assign mcrute for approval by writing /assign @mcrute
in a comment. For more information see:The Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
@ipochi: The following tests failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
pull-apiserver-network-proxy-make-lint | d9e5e96f12ed014f0ee031305ef5e63cfa3df458 | link | true | /test pull-apiserver-network-proxy-make-lint |
pull-apiserver-network-proxy-docker-build-arm64 | d9e5e96f12ed014f0ee031305ef5e63cfa3df458 | link | true | /test pull-apiserver-network-proxy-docker-build-arm64 |
pull-apiserver-network-proxy-docker-build-amd64 | d9e5e96f12ed014f0ee031305ef5e63cfa3df458 | link | true | /test pull-apiserver-network-proxy-docker-build-amd64 |
pull-apiserver-network-proxy-test | d9e5e96f12ed014f0ee031305ef5e63cfa3df458 | link | true | /test pull-apiserver-network-proxy-test |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closed
You can:
- Reopen this PR with
/reopen
- Mark this PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closed this PR.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closedYou can:
- Reopen this PR with
/reopen
- Mark this PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.