gateway icon indicating copy to clipboard operation
gateway copied to clipboard

Add e2e test to validate hitless reload of Envoy Gateway

Open arkodg opened this issue 1 year ago • 24 comments

Description: Add a test case in our e2e framework https://github.com/envoyproxy/gateway/tree/main/test/e2e where

  • a client continuously sends traffic to a backend
  • Envoy Gateway is restarted multiple times
  • client successfully sends traffic using existing connection

arkodg avatar Jun 08 '23 00:06 arkodg

Some background on this, I hit an issue where after restarting the controller or after a resync period envoy begins draining the existing listeners. I believe this is caused by a re-ordering of the routes sent over xDS.

In my case I had configured 2 TLSRoute resources and without any changes envoy began draining and eventually sent TCP resets to the long lived connections. I think a key part of this test is having several routes configured in order to reproduce the undesired behavior.

dboslee avatar Jun 08 '23 19:06 dboslee

@dboslee long lived connections might be a separate sub issue related to keep alive timeouts (currently disabled by default)

arkodg avatar Jun 08 '23 20:06 arkodg

long lived connections might be a separate sub issue related to keep alive timeouts (currently disabled by default)

To clarify in my case the connections were hitting the default 600 second drain timeout (which can be configured in envoy via --drain-time-s). But this only happens when a given filter chain gets drained. So 10 minutes after restarting the controller all the connections get reset.

dboslee avatar Jun 09 '23 01:06 dboslee

@dboslee I would like to understand more about the new e2e test requirement.

What would be the short name and description for this new test? How about these:

Short name: GatewayRestartDraining
Description: Verify that the client can successfully send traffic using an existing connection even after the Envoy gateway/controller restarts.

Do we expect the client to fail to send requests after 10 minutes following the restart, as the connection is reset after 10 minutes?

Thanks!

Ronnie-personal avatar Jul 06 '23 02:07 Ronnie-personal

@arkodg Do we have any sample test code that I can mimic? I see two existing tests, ratelimit and accesslog. Which one would be a better baseline code to copy for this specific requirement? Thanks!

Ronnie-personal avatar Jul 06 '23 03:07 Ronnie-personal

Hi @Ronnie-personal thanks for the looking into this issue, assigning this you for now

  • a good test to add would be a Reload test where we reload Envoy Gateway multiple times but the the /config_dump output from Envoy Proxy https://www.envoyproxy.io/docs/envoy/latest/operations/admin#operations-admin-interface-config-dump remains unchanged thereby proving that after every restart Envoy Gateway generates identical config
  • a good example is the ratelimit test https://github.com/envoyproxy/gateway/blob/main/test/e2e/tests/ratelimit.go

arkodg avatar Jul 06 '23 03:07 arkodg

@arkodg Would you help me on one question? When we say reload Envoy gateway, we mean delete the gateway resource and recreate it? For example, delete gateway named 'same-namespace' in namespace 'gateway-conformance-infra' and recreate it?

Thanks!

Ronnie-personal avatar Jul 07 '23 02:07 Ronnie-personal

@arkodg I have some ideas on /config_dump, would you mind checking if I'm on the right track?

Here are the steps in the reload.go test code,

  • Obtain Kubernetes client from cSuite
  • Use the client to get Envoy Proxy service by name 'proxy-config' and namespace 'envoy-gateway-system' (e.g --selector=gateway.envoyproxy.io/owning-gateway-namespace=envoy-gateway-system,gateway.envoyproxy.io/owning-gateway-name=proxy-config)
  • Get admin url (host/ip and port) from the service
  • Send http get request to the url/config_dump, save the response.
  • Send http get request again and compare the config dump with previous response.

Thanks!

Ronnie-personal avatar Jul 07 '23 03:07 Ronnie-personal

hey @Ronnie-personal that looks right ! updating some steps

Obtain Kubernetes client from cSuite

  1. Use the client to get Envoy Proxy service by name 'proxy-config' and namespace 'envoy-gateway-system' (e.g --selector=gateway.envoyproxy.io/owning-gateway-namespace=envoy-gateway-system,gateway.envoyproxy.io/owning-gateway-name=proxy-config)
  2. port forward to envoy proxy pod, since the admin host is 127.0.0.1
  3. Send http get request to the url/config_dump, save the response.
  4. restart Envoy Gateway using kubernetes client
  5. Send http get request again and compare the config dump with previous response.

arkodg avatar Jul 07 '23 03:07 arkodg

@arkodg Would you help me on one question? When we say reload Envoy gateway, we mean delete the gateway resource and recreate it? For example, delete gateway named 'same-namespace' in namespace 'gateway-conformance-infra' and recreate it?

Thanks!

@arkodg Thanks for the prompt reply! Could you confirm my understanding of 'restart Envoy Gateway'? This is for for the step #4.

Thanks!

Ronnie-personal avatar Jul 07 '23 14:07 Ronnie-personal

Could you confirm my understanding of 'restart Envoy Gateway'? This is for for the step #4.

This means restarting the envoy gateway controller. Which can be done by deleting the controller pods or updating the envoy gateway controller deployment in such a way that it triggers a rollout of new pods for you.

dboslee avatar Jul 07 '23 15:07 dboslee

Also update on the issue which sparked this conversation, while I initially thought the issue was caused by route order changing this was just a trigger for the actual root cause. Which is that the TCP listener name is set to name of the first route processed by the controller so when routes get processed in a different order or the route the TCP listener is named after is deleted the TCP listener name changes and the listener is drained.

This happens here https://github.com/envoyproxy/gateway/blob/708287e53af5146873c91c9a4cda2a9ed8126f03/internal/xds/translator/translator.go#L230 where ir.TCPListener.Name is actually the name of a specific TCPRoute or TLSRoute. 

dboslee avatar Jul 07 '23 15:07 dboslee

Could you confirm my understanding of 'restart Envoy Gateway'? This is for for the step #4.

This means restarting the envoy gateway controller. Which can be done by deleting the controller pods or updating the envoy gateway controller deployment in such a way that it triggers a rollout of new pods for you.

Thanks for the clarification. Correct me if I'm wrong, so I need to locate the envoy gateway service, delete it and recreate? If you happen to know where the envoy gateway gets created in the code stack, please let me know. I have looked at https://github.com/envoyproxy/gateway/blob/main/test/config/gatewayclass.yaml and https://github.com/envoyproxy/gateway/blob/main/test/e2e/base/manifests.yaml. But none of them defines envoy gateway.

Thanks!

Ronnie-personal avatar Jul 07 '23 16:07 Ronnie-personal

Also update on the issue which sparked this conversation, while I initially thought the issue was caused by route order changing this was just a trigger for the actual root cause. Which is that the TCP listener name is set to name of the first route processed by the controller so when routes get processed in a different order or the route the TCP listener is named after is deleted the TCP listener name changes and the listener is drained.

This happens here

https://github.com/envoyproxy/gateway/blob/708287e53af5146873c91c9a4cda2a9ed8126f03/internal/xds/translator/translator.go#L230

where ir.TCPListener.Name is actually the name of a specific TCPRoute or TLSRoute.

Do we consider it as a bug? What would be the options to fix this root issue?

Ronnie-personal avatar Jul 07 '23 16:07 Ronnie-personal

thanks for highlighting this @dboslee ! as @Ronnie-personal pointed out, we do incorporate the route name in the tcp listener (refer to https://github.com/envoyproxy/gateway/pull/696) since the tcp listener directly maps to a cluster / destination https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/network/tcp_proxy/v3/tcp_proxy.proto#envoy-v3-api-msg-extensions-filters-network-tcp-proxy-v3-tcpproxy . So as you've shared, if the routes are processed in a different order, the order for the tcp listeners will also change causing the hot restart https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/hot_restart .

  • Sorting pre or post in the gateway-api layer might be a good enough solution here

  • Another thing to ensure mainly for performance reasons is to prevent retranslation all together in gateway-api by ensuring the the watchable message for resources can skip updating to subscriber (gateway-api here) published by the producer (provider) if the contents of the resource list hasn't changed, even if the order of the resource list has.

will raise GH issues for each of these so this GH issue around E2E is not forgotten :)

arkodg avatar Jul 07 '23 19:07 arkodg

I did a quick manual test locally using following steps,

  • I created envoy gateway and quick-start resources.
  • I saved the envoy proxy config_dump from 127.0.0.1:19000/config_dump
  • Delete the envoy gateway pod, kubectl get pods --selector=control-plane=envoy-gateway -n envoy-gateway-system
  • Another envoy gateway pod is up and running
  • Obtain config_dump and compare it with the previous one
  • The fields in the 'metadata' section are reordered, even though the actual data is exactly the same. (please see screenshot) image

My questions are: Is this an expected behavior? Should we ignore the 'metadata' section when comparing multiple config_dumps? Which sections of the config_dump really matter for the comparison?

Ronnie-personal avatar Jul 08 '23 02:07 Ronnie-personal

  • this is the right approach @Ronnie-personal, you can skip comparison of some fields using cmp.Diff which we use in this project, here's an example. from the above image, it looks like content is same, some config fields order has changed (might be due to usage of a map which lacks order), unsure if cmp.Diff handles this w/o any extra work or might require additional options like cmpopts.SortMaps to be passed to it
  • its fine to compare the entire config_dump

arkodg avatar Jul 10 '23 20:07 arkodg

@arkodg I have completed the initial coding work, I'm able to test the major logic as a standalone go code locally. I'm not sure how to test this new e2e test under the entire envoy gateway code stack.

Here is the 'reload' test code https://github.com/envoyproxy/gateway/compare/main...Ronnie-personal:gateway:e2ereload1503 Please feel free to give a quick review.

Thanks, Ronnie

Ronnie-personal avatar Jul 15 '23 21:07 Ronnie-personal

hey @Ronnie-personal thanks for continuing to work on this, can you raise this as a PR so others in the community can also review, tia !

arkodg avatar Jul 18 '23 20:07 arkodg

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Aug 31 '23 00:08 github-actions[bot]

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Nov 20 '23 00:11 github-actions[bot]

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Jan 06 '24 20:01 github-actions[bot]

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Mar 09 '24 08:03 github-actions[bot]

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Apr 27 '24 08:04 github-actions[bot]

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Jun 22 '24 08:06 github-actions[bot]

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Aug 30 '24 20:08 github-actions[bot]