apiserver-network-proxy icon indicating copy to clipboard operation
apiserver-network-proxy copied to clipboard

(WIP - Do not review) Terminate Idle connections by setting a timeout to free up resources

Open ipochi opened this issue 3 years ago • 16 comments

This PR fixes #276 by addressing the issue #311.

We introduce a new flag --grpc-max-idle-time after which any idle connections from frontend(kube-apiserver) to proxy-server will be terminated.

To test this PR: reproduce the steps:

  1. checkout this PR/branch locally and build the proxy-server binary
  2. Follow steps described in the issue #311.

Start the proxy-server with:

./bin/proxy-server --server-port=0 --uds-name=/tmp/uds-proxy --cluster-ca-cert=certs/agent/issued/ca.crt --cluster-cert=certs/agent/issued/proxy-frontend.crt --cluster-key=certs/agent/private/proxy-frontend.key --warn-on-channel-limit --v 5 --keepalive-time 1m --frontend-keepalive-time 1m  --enable-profiling --grpc-max-idle-time 3m 2>&1 | tee server.log

Before running the stress test, get the following metrics

$ curl -s localhost:8095/metrics | grep -E "grpc"
# HELP konnectivity_network_proxy_server_grpc_connections Number of current grpc connections, partitioned by service method.
# TYPE konnectivity_network_proxy_server_grpc_connections gauge
konnectivity_network_proxy_server_grpc_connections{service_method="Connect"} 1
konnectivity_network_proxy_server_grpc_connections{service_method="Proxy"} 0


$ curl -s localhost:8095/metrics | grep -E "open_fds"
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 13

Take a note of the metrics, during the stress test:

Once the stress test is complete and enough time has passed to trigger a --grpc-max-idle-time timeout, you'll notice that the metrics values would be pre-stress test levels.

/cc @rata @iaguis

ipochi avatar Dec 15 '21 12:12 ipochi

@ipochi: GitHub didn't allow me to request PR reviews from the following users: rata, iaguis.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

This PR fixes #276 by addressing the issue #311.

We introduce a new flag --grpc-max-idle-time after which any idle connections will be terminated.

To test this PR: reproduce the steps:

  1. checkout this PR/branch locally and build the proxy-server binary
  2. Follow steps described in the issue #311.

Start the proxy-server with:

./bin/proxy-server --server-port=0 --uds-name=/tmp/uds-proxy --cluster-ca-cert=certs/agent/issued/ca.crt --cluster-cert=certs/agent/issued/proxy-frontend.crt --cluster-key=certs/agent/private/proxy-frontend.key --warn-on-channel-limit --v 5 --keepalive-time 1m --frontend-keepalive-time 1m  --enable-profiling --grpc-max-idle-time 3m 2>&1 | tee server.log

Before running the stress test, get the following metrics

$ curl -s localhost:8095/metrics | grep -E "grpc"
# HELP konnectivity_network_proxy_server_grpc_connections Number of current grpc connections, partitioned by service method.
# TYPE konnectivity_network_proxy_server_grpc_connections gauge
konnectivity_network_proxy_server_grpc_connections{service_method="Connect"} 1
konnectivity_network_proxy_server_grpc_connections{service_method="Proxy"} 0


$ curl -s localhost:8095/metrics | grep -E "open_fds"
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 13

Take a note of the metrics, during the stress test:

Once the stress test is complete and enough time has passed to trigger a --grpc-max-idle-time timeout, you'll notice that the metrics values would be pre-stress test levels.

/cc @rata @iaguis

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Dec 15 '21 12:12 k8s-ci-robot

Welcome @ipochi!

It looks like this is your first PR to kubernetes-sigs/apiserver-network-proxy 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/apiserver-network-proxy has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot avatar Dec 15 '21 12:12 k8s-ci-robot

Hi @ipochi. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Dec 15 '21 12:12 k8s-ci-robot

/ok-to-test

invidian avatar Dec 15 '21 13:12 invidian

I'm curious what Walter Fender and Joseph Anttila Hall think about this "big hammer" hack. This is what is mentioned in the 4th item here, I explain why I'd like some big hammer like that (basically, when the leak is in the konn-client, we need to backport to k8s versions and having a fix in the proxy-server is way easier to deploy, besides helping with future regressions too until they are fixed. Currently we had to disable konnectivity until we fix these leaks, with a hammer, we expect that is not needed to disable it but can use the hammer until we fix the leak).

As long as this is an opt in change (which is seems to be) I am pro the approach. I do have a couple of minor concerns. One is this introduces yet another ID for us to track. I would like to see if it corresponds to something we already have. We already have a lot of IDs and things to track. Adding another one just makes debugging the system even harder.

I also would like to see if we can fix the frontend connection to be mux'd on a single gRPC connection between each KAS * KonnServer pair (when gRPC is configured). I don't think this prevents that but it may complicate it slightly.

cheftako avatar Dec 22 '21 00:12 cheftako

# sigs.k8s.io/apiserver-network-proxy/tests [sigs.k8s.io/apiserver-network-proxy/tests.test]
tests/proxy_test.go:427:33: not enough arguments in call to server.NewProxyServer
	have (string, []server.ProxyStrategy, int, *server.AgentTokenAuthenticationOptions, bool)
	want (string, []server.ProxyStrategy, int, *server.AgentTokenAuthenticationOptions, bool, time.Duration)
tests/proxy_test.go:466:28: not enough arguments in call to server.NewProxyServer
	have (string, []server.ProxyStrategy, number, *server.AgentTokenAuthenticationOptions, bool)
	want (string, []server.ProxyStrategy, int, *server.AgentTokenAuthenticationOptions, bool, time.Duration)
# sigs.k8s.io/apiserver-network-proxy/pkg/server [sigs.k8s.io/apiserver-network-proxy/pkg/server.test]
pkg/server/server_test.go:162:23: not enough arguments in call to NewProxyServer
	have (string, []ProxyStrategy, number, *AgentTokenAuthenticationOptions, bool)
	want (string, []ProxyStrategy, int, *AgentTokenAuthenticationOptions, bool, time.Duration)
pkg/server/server_test.go:192:21: not enough arguments in call to NewProxyServer
	have (string, []ProxyStrategy, number, nil, bool)
	want (string, []ProxyStrategy, int, *AgentTokenAuthenticationOptions, bool, time.Duration)
pkg/server/server_test.go:200:20: not enough arguments in call to NewProxyServer
	have (string, []ProxyStrategy, number, nil, bool)
	want (string, []ProxyStrategy, int, *AgentTokenAuthenticationOptions, bool, time.Duration)
FAIL	sigs.k8s.io/apiserver-network-proxy/pkg/server [build failed]

cheftako avatar Dec 22 '21 00:12 cheftako

@cheftako cool, good to know this approach looks fine if it is opt-in (as it is, yes :)). Thanks!

And yeah, I'd like this to be simpler too. We will see with @ipochi if this can be simplified or at least reduce the complexity when this opt-in flag is not used.

rata avatar Dec 22 '21 15:12 rata

UPDATE:

Found an issue with this PR which continuously increases memory after 5 or 6 hours. Still trying to debug the cause of this issue.

While the PR terminates idle connections which decrease the open_fds and vendor/google.golang.org/grpc/internal/transport/http2_server.go metric , surprisingly this only results is ~10% memory usage reduction despite closing all open grpc connections.

This same behaviour or only ~10% memory usage reduction after the test was noticed by using the out of the box gRPC Keepalive parameters (MaxConnectionAge and MaxConnectionAgeGrace). Curently outweighing the pros and cons of both the approaches to terminate idle connections:

This PR : terminates frontend->proxy-server connections based on inactivity after a specified duration. gRPC Params: terminates all frontend->proxy-server connections regardless of activity.

/cc: @rata

ipochi avatar Jan 03 '22 14:01 ipochi

@ipochi: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 04 '22 05:03 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 02 '22 06:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jul 02 '22 06:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Aug 01 '22 07:08 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 01 '22 07:08 k8s-ci-robot

/remove-lifecycle rotten

cheftako avatar Sep 09 '22 22:09 cheftako

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ipochi Once this PR has been reviewed and has the lgtm label, please assign mcrute for approval by writing /assign @mcrute in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Sep 09 '22 22:09 k8s-ci-robot

@ipochi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-apiserver-network-proxy-make-lint d9e5e96f12ed014f0ee031305ef5e63cfa3df458 link true /test pull-apiserver-network-proxy-make-lint
pull-apiserver-network-proxy-docker-build-arm64 d9e5e96f12ed014f0ee031305ef5e63cfa3df458 link true /test pull-apiserver-network-proxy-docker-build-arm64
pull-apiserver-network-proxy-docker-build-amd64 d9e5e96f12ed014f0ee031305ef5e63cfa3df458 link true /test pull-apiserver-network-proxy-docker-build-amd64
pull-apiserver-network-proxy-test d9e5e96f12ed014f0ee031305ef5e63cfa3df458 link true /test pull-apiserver-network-proxy-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot avatar Sep 09 '22 22:09 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 08 '22 22:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jan 07 '23 23:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Feb 07 '23 00:02 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 07 '23 00:02 k8s-ci-robot