gateway
gateway copied to clipboard
Add e2e test to validate hitless reload of Envoy Gateway
Description: Add a test case in our e2e framework https://github.com/envoyproxy/gateway/tree/main/test/e2e where
- a client continuously sends traffic to a backend
- Envoy Gateway is restarted multiple times
- client successfully sends traffic using existing connection
Some background on this, I hit an issue where after restarting the controller or after a resync period envoy begins draining the existing listeners. I believe this is caused by a re-ordering of the routes sent over xDS.
In my case I had configured 2 TLSRoute resources and without any changes envoy began draining and eventually sent TCP resets to the long lived connections. I think a key part of this test is having several routes configured in order to reproduce the undesired behavior.
@dboslee long lived connections might be a separate sub issue related to keep alive timeouts (currently disabled by default)
long lived connections might be a separate sub issue related to keep alive timeouts (currently disabled by default)
To clarify in my case the connections were hitting the default 600 second
drain timeout (which can be configured in envoy via --drain-time-s
). But this only happens when a given filter chain gets drained. So 10 minutes after restarting the controller all the connections get reset.
@dboslee I would like to understand more about the new e2e test requirement.
What would be the short name and description for this new test? How about these:
Short name: GatewayRestartDraining
Description: Verify that the client can successfully send traffic using an existing connection even after the Envoy gateway/controller restarts.
Do we expect the client to fail to send requests after 10 minutes following the restart, as the connection is reset after 10 minutes?
Thanks!
@arkodg Do we have any sample test code that I can mimic? I see two existing tests, ratelimit and accesslog. Which one would be a better baseline code to copy for this specific requirement? Thanks!
Hi @Ronnie-personal thanks for the looking into this issue, assigning this you for now
- a good test to add would be a
Reload
test where we reload Envoy Gateway multiple times but the the/config_dump
output from Envoy Proxy https://www.envoyproxy.io/docs/envoy/latest/operations/admin#operations-admin-interface-config-dump remains unchanged thereby proving that after every restart Envoy Gateway generates identical config - a good example is the ratelimit test https://github.com/envoyproxy/gateway/blob/main/test/e2e/tests/ratelimit.go
@arkodg Would you help me on one question? When we say reload Envoy gateway, we mean delete the gateway resource and recreate it? For example, delete gateway named 'same-namespace' in namespace 'gateway-conformance-infra' and recreate it?
Thanks!
@arkodg I have some ideas on /config_dump, would you mind checking if I'm on the right track?
Here are the steps in the reload.go test code,
- Obtain Kubernetes client from cSuite
- Use the client to get Envoy Proxy service by name 'proxy-config' and namespace 'envoy-gateway-system' (e.g --selector=gateway.envoyproxy.io/owning-gateway-namespace=envoy-gateway-system,gateway.envoyproxy.io/owning-gateway-name=proxy-config)
- Get admin url (host/ip and port) from the service
- Send http get request to the url/config_dump, save the response.
- Send http get request again and compare the config dump with previous response.
Thanks!
hey @Ronnie-personal that looks right ! updating some steps
Obtain Kubernetes client from cSuite
- Use the client to get Envoy Proxy service by name 'proxy-config' and namespace 'envoy-gateway-system' (e.g --selector=gateway.envoyproxy.io/owning-gateway-namespace=envoy-gateway-system,gateway.envoyproxy.io/owning-gateway-name=proxy-config)
-
port forward to envoy proxy pod, since the admin host is
127.0.0.1
- Send http get request to the url/config_dump, save the response.
- restart Envoy Gateway using kubernetes client
- Send http get request again and compare the config dump with previous response.
@arkodg Would you help me on one question? When we say reload Envoy gateway, we mean delete the gateway resource and recreate it? For example, delete gateway named 'same-namespace' in namespace 'gateway-conformance-infra' and recreate it?
Thanks!
@arkodg Thanks for the prompt reply! Could you confirm my understanding of 'restart Envoy Gateway'? This is for for the step #4.
Thanks!
Could you confirm my understanding of 'restart Envoy Gateway'? This is for for the step #4.
This means restarting the envoy gateway controller. Which can be done by deleting the controller pods or updating the envoy gateway controller deployment in such a way that it triggers a rollout of new pods for you.
Also update on the issue which sparked this conversation, while I initially thought the issue was caused by route order changing this was just a trigger for the actual root cause. Which is that the TCP listener name is set to name of the first route processed by the controller so when routes get processed in a different order or the route the TCP listener is named after is deleted the TCP listener name changes and the listener is drained.
This happens here https://github.com/envoyproxy/gateway/blob/708287e53af5146873c91c9a4cda2a9ed8126f03/internal/xds/translator/translator.go#L230 where ir.TCPListener.Name
is actually the name of a specific TCPRoute or TLSRoute.
Could you confirm my understanding of 'restart Envoy Gateway'? This is for for the step #4.
This means restarting the envoy gateway controller. Which can be done by deleting the controller pods or updating the envoy gateway controller deployment in such a way that it triggers a rollout of new pods for you.
Thanks for the clarification. Correct me if I'm wrong, so I need to locate the envoy gateway service, delete it and recreate? If you happen to know where the envoy gateway gets created in the code stack, please let me know. I have looked at https://github.com/envoyproxy/gateway/blob/main/test/config/gatewayclass.yaml and https://github.com/envoyproxy/gateway/blob/main/test/e2e/base/manifests.yaml. But none of them defines envoy gateway.
Thanks!
Also update on the issue which sparked this conversation, while I initially thought the issue was caused by route order changing this was just a trigger for the actual root cause. Which is that the TCP listener name is set to name of the first route processed by the controller so when routes get processed in a different order or the route the TCP listener is named after is deleted the TCP listener name changes and the listener is drained.
This happens here
https://github.com/envoyproxy/gateway/blob/708287e53af5146873c91c9a4cda2a9ed8126f03/internal/xds/translator/translator.go#L230
where
ir.TCPListener.Name
is actually the name of a specific TCPRoute or TLSRoute.
Do we consider it as a bug? What would be the options to fix this root issue?
thanks for highlighting this @dboslee !
as @Ronnie-personal pointed out, we do incorporate the route name in the tcp listener (refer to https://github.com/envoyproxy/gateway/pull/696) since the tcp listener directly maps to a cluster
/ destination
https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/network/tcp_proxy/v3/tcp_proxy.proto#envoy-v3-api-msg-extensions-filters-network-tcp-proxy-v3-tcpproxy .
So as you've shared, if the routes are processed in a different order, the order for the tcp listeners will also change causing the hot restart https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/hot_restart .
-
Sorting
pre
orpost
in thegateway-api
layer might be a good enough solution here -
Another thing to ensure mainly for performance reasons is to prevent retranslation all together in
gateway-api
by ensuring the the watchable message for resources can skip updating to subscriber (gateway-api
here) published by the producer (provider
) if the contents of the resource list hasn't changed, even if the order of the resource list has.
will raise GH issues for each of these so this GH issue around E2E is not forgotten :)
I did a quick manual test locally using following steps,
- I created envoy gateway and quick-start resources.
- I saved the envoy proxy config_dump from 127.0.0.1:19000/config_dump
- Delete the envoy gateway pod, kubectl get pods --selector=control-plane=envoy-gateway -n envoy-gateway-system
- Another envoy gateway pod is up and running
- Obtain config_dump and compare it with the previous one
- The fields in the 'metadata' section are reordered, even though the actual data is exactly the same. (please see screenshot)
My questions are: Is this an expected behavior? Should we ignore the 'metadata' section when comparing multiple config_dumps? Which sections of the config_dump really matter for the comparison?
- this is the right approach @Ronnie-personal, you can skip comparison of some fields using
cmp.Diff
which we use in this project, here's an example. from the above image, it looks like content is same, some config fields order has changed (might be due to usage of a map which lacks order), unsure ifcmp.Diff
handles this w/o any extra work or might require additional options likecmpopts.SortMaps
to be passed to it - its fine to compare the entire config_dump
@arkodg I have completed the initial coding work, I'm able to test the major logic as a standalone go code locally. I'm not sure how to test this new e2e test under the entire envoy gateway code stack.
Here is the 'reload' test code https://github.com/envoyproxy/gateway/compare/main...Ronnie-personal:gateway:e2ereload1503 Please feel free to give a quick review.
Thanks, Ronnie
hey @Ronnie-personal thanks for continuing to work on this, can you raise this as a PR so others in the community can also review, tia !
This issue has been automatically marked as stale because it has not had activity in the last 30 days.
This issue has been automatically marked as stale because it has not had activity in the last 30 days.
This issue has been automatically marked as stale because it has not had activity in the last 30 days.
This issue has been automatically marked as stale because it has not had activity in the last 30 days.
This issue has been automatically marked as stale because it has not had activity in the last 30 days.
This issue has been automatically marked as stale because it has not had activity in the last 30 days.
This issue has been automatically marked as stale because it has not had activity in the last 30 days.