gateway Add e2e test to validate hitless reload of Envoy Gateway

Description: Add a test case in our e2e framework https://github.com/envoyproxy/gateway/tree/main/test/e2e where

a client continuously sends traffic to a backend
Envoy Gateway is restarted multiple times
client successfully sends traffic using existing connection

Jun 08 '23 00:06 arkodg

Some background on this, I hit an issue where after restarting the controller or after a resync period envoy begins draining the existing listeners. I believe this is caused by a re-ordering of the routes sent over xDS.

In my case I had configured 2 TLSRoute resources and without any changes envoy began draining and eventually sent TCP resets to the long lived connections. I think a key part of this test is having several routes configured in order to reproduce the undesired behavior.

Jun 08 '23 19:06 dboslee

@dboslee long lived connections might be a separate sub issue related to keep alive timeouts (currently disabled by default)

Jun 08 '23 20:06 arkodg

long lived connections might be a separate sub issue related to keep alive timeouts (currently disabled by default)

To clarify in my case the connections were hitting the default 600 second drain timeout (which can be configured in envoy via --drain-time-s). But this only happens when a given filter chain gets drained. So 10 minutes after restarting the controller all the connections get reset.

Jun 09 '23 01:06 dboslee

@dboslee I would like to understand more about the new e2e test requirement.

What would be the short name and description for this new test? How about these:

Short name: GatewayRestartDraining
Description: Verify that the client can successfully send traffic using an existing connection even after the Envoy gateway/controller restarts.

Do we expect the client to fail to send requests after 10 minutes following the restart, as the connection is reset after 10 minutes?

Thanks!

Jul 06 '23 02:07 Ronnie-personal

@arkodg Do we have any sample test code that I can mimic? I see two existing tests, ratelimit and accesslog. Which one would be a better baseline code to copy for this specific requirement? Thanks!

Jul 06 '23 03:07 Ronnie-personal

Hi @Ronnie-personal thanks for the looking into this issue, assigning this you for now

a good test to add would be a Reload test where we reload Envoy Gateway multiple times but the the /config_dump output from Envoy Proxy https://www.envoyproxy.io/docs/envoy/latest/operations/admin#operations-admin-interface-config-dump remains unchanged thereby proving that after every restart Envoy Gateway generates identical config
a good example is the ratelimit test https://github.com/envoyproxy/gateway/blob/main/test/e2e/tests/ratelimit.go

Jul 06 '23 03:07 arkodg

@arkodg Would you help me on one question? When we say reload Envoy gateway, we mean delete the gateway resource and recreate it? For example, delete gateway named 'same-namespace' in namespace 'gateway-conformance-infra' and recreate it?

Thanks!

Jul 07 '23 02:07 Ronnie-personal

@arkodg I have some ideas on /config_dump, would you mind checking if I'm on the right track?

Here are the steps in the reload.go test code,

Obtain Kubernetes client from cSuite
Use the client to get Envoy Proxy service by name 'proxy-config' and namespace 'envoy-gateway-system' (e.g --selector=gateway.envoyproxy.io/owning-gateway-namespace=envoy-gateway-system,gateway.envoyproxy.io/owning-gateway-name=proxy-config)
Get admin url (host/ip and port) from the service
Send http get request to the url/config_dump, save the response.
Send http get request again and compare the config dump with previous response.

Thanks!

Jul 07 '23 03:07 Ronnie-personal

hey @Ronnie-personal that looks right ! updating some steps

Obtain Kubernetes client from cSuite

Use the client to get Envoy Proxy service by name 'proxy-config' and namespace 'envoy-gateway-system' (e.g --selector=gateway.envoyproxy.io/owning-gateway-namespace=envoy-gateway-system,gateway.envoyproxy.io/owning-gateway-name=proxy-config)
port forward to envoy proxy pod, since the admin host is 127.0.0.1
Send http get request to the url/config_dump, save the response.
restart Envoy Gateway using kubernetes client
Send http get request again and compare the config dump with previous response.

Jul 07 '23 03:07 arkodg

@arkodg Would you help me on one question? When we say reload Envoy gateway, we mean delete the gateway resource and recreate it? For example, delete gateway named 'same-namespace' in namespace 'gateway-conformance-infra' and recreate it?

Thanks!

@arkodg Thanks for the prompt reply! Could you confirm my understanding of 'restart Envoy Gateway'? This is for for the step #4.

Thanks!

Jul 07 '23 14:07 Ronnie-personal

Could you confirm my understanding of 'restart Envoy Gateway'? This is for for the step #4.

This means restarting the envoy gateway controller. Which can be done by deleting the controller pods or updating the envoy gateway controller deployment in such a way that it triggers a rollout of new pods for you.

Jul 07 '23 15:07 dboslee

Also update on the issue which sparked this conversation, while I initially thought the issue was caused by route order changing this was just a trigger for the actual root cause. Which is that the TCP listener name is set to name of the first route processed by the controller so when routes get processed in a different order or the route the TCP listener is named after is deleted the TCP listener name changes and the listener is drained.

This happens here https://github.com/envoyproxy/gateway/blob/708287e53af5146873c91c9a4cda2a9ed8126f03/internal/xds/translator/translator.go#L230 where ir.TCPListener.Name is actually the name of a specific TCPRoute or TLSRoute.

Jul 07 '23 15:07 dboslee

Could you confirm my understanding of 'restart Envoy Gateway'? This is for for the step #4.

This means restarting the envoy gateway controller. Which can be done by deleting the controller pods or updating the envoy gateway controller deployment in such a way that it triggers a rollout of new pods for you.

Thanks for the clarification. Correct me if I'm wrong, so I need to locate the envoy gateway service, delete it and recreate? If you happen to know where the envoy gateway gets created in the code stack, please let me know. I have looked at https://github.com/envoyproxy/gateway/blob/main/test/config/gatewayclass.yaml and https://github.com/envoyproxy/gateway/blob/main/test/e2e/base/manifests.yaml. But none of them defines envoy gateway.

Thanks!

Jul 07 '23 16:07 Ronnie-personal

Also update on the issue which sparked this conversation, while I initially thought the issue was caused by route order changing this was just a trigger for the actual root cause. Which is that the TCP listener name is set to name of the first route processed by the controller so when routes get processed in a different order or the route the TCP listener is named after is deleted the TCP listener name changes and the listener is drained.

This happens here

https://github.com/envoyproxy/gateway/blob/708287e53af5146873c91c9a4cda2a9ed8126f03/internal/xds/translator/translator.go#L230

where ir.TCPListener.Name is actually the name of a specific TCPRoute or TLSRoute.

Do we consider it as a bug? What would be the options to fix this root issue?

Jul 07 '23 16:07 Ronnie-personal

thanks for highlighting this @dboslee ! as @Ronnie-personal pointed out, we do incorporate the route name in the tcp listener (refer to https://github.com/envoyproxy/gateway/pull/696) since the tcp listener directly maps to a cluster / destination https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/network/tcp_proxy/v3/tcp_proxy.proto#envoy-v3-api-msg-extensions-filters-network-tcp-proxy-v3-tcpproxy . So as you've shared, if the routes are processed in a different order, the order for the tcp listeners will also change causing the hot restart https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/hot_restart .

Sorting pre or post in the gateway-api layer might be a good enough solution here
Another thing to ensure mainly for performance reasons is to prevent retranslation all together in gateway-api by ensuring the the watchable message for resources can skip updating to subscriber (gateway-api here) published by the producer (provider) if the contents of the resource list hasn't changed, even if the order of the resource list has.

will raise GH issues for each of these so this GH issue around E2E is not forgotten :)

Jul 07 '23 19:07 arkodg

I did a quick manual test locally using following steps,

I created envoy gateway and quick-start resources.
I saved the envoy proxy config_dump from 127.0.0.1:19000/config_dump
Delete the envoy gateway pod, kubectl get pods --selector=control-plane=envoy-gateway -n envoy-gateway-system
Another envoy gateway pod is up and running
Obtain config_dump and compare it with the previous one
The fields in the 'metadata' section are reordered, even though the actual data is exactly the same. (please see screenshot)

My questions are: Is this an expected behavior? Should we ignore the 'metadata' section when comparing multiple config_dumps? Which sections of the config_dump really matter for the comparison?

Jul 08 '23 02:07 Ronnie-personal

this is the right approach @Ronnie-personal, you can skip comparison of some fields using cmp.Diff which we use in this project, here's an example. from the above image, it looks like content is same, some config fields order has changed (might be due to usage of a map which lacks order), unsure if cmp.Diff handles this w/o any extra work or might require additional options like cmpopts.SortMaps to be passed to it
its fine to compare the entire config_dump

Jul 10 '23 20:07 arkodg

@arkodg I have completed the initial coding work, I'm able to test the major logic as a standalone go code locally. I'm not sure how to test this new e2e test under the entire envoy gateway code stack.

Here is the 'reload' test code https://github.com/envoyproxy/gateway/compare/main...Ronnie-personal:gateway:e2ereload1503 Please feel free to give a quick review.

Thanks, Ronnie

Jul 15 '23 21:07 Ronnie-personal

hey @Ronnie-personal thanks for continuing to work on this, can you raise this as a PR so others in the community can also review, tia !

Jul 18 '23 20:07 arkodg