gateway
gateway copied to clipboard
fix e2e EnvoyShutdown
What type of PR is this?
What this PR does / why we need it:
- add
testinto codecov ignore path - trying to fix #3262 by increasing the HTTPRoute Duration and move load generation after the deployment has been restarted
- fix the order of merge-gateways e2e test in makefile
Which issue(s) this PR fixes:
Fixes #3262
/retest
move load generation after the deployment has been restarted
Hi @shawnh2! The purpose of the test is to run load during the restart and assert that there's no loss of requests... Running it after restart means that we're not really testing for hitless shutdown anymore..
move load generation after the deployment has been restarted
Hi @shawnh2! The purpose of the test is to run load during the restart and assert that there's no loss of requests... Running it after restart means that we're not really testing for hitless shutdown anymore..
Thanks for the clarification, by root causing this e2e test case, sometimes after the restart process, the url that one request send to may be a 404 path, meaning there are some loss requests.
Maybe by sending another request to detect the url path right after the restart process can avoid this situation.
move load generation after the deployment has been restarted
Hi @shawnh2! The purpose of the test is to run load during the restart and assert that there's no loss of requests... Running it after restart means that we're not really testing for hitless shutdown anymore..
Thanks for the clarification, by root causing this e2e test case, sometimes after the restart process, the url that one request send to may be a 404 path, meaning there are some loss requests.
Maybe by sending another request to detect the url path right after the restart process can avoid this situation.
Maybe the upgrade process isn't hitless anymore, I hadn't noticed the 404 issue before, so perhaps that's a new thing. Perhaps it's worth implementing a temporary workaround to prevent instability, but we should also address the underlying issue that causes some requests to result in a 404 error during an upgrade.
Hi @shawnh2 , @alexwo - that's interesting input. Yes, I assume that if failures are 404s, then we have a problem with new envoy proxies receiving traffic before they are programmed.
In that case, I think we can split this activity to two parts:
- Run load and log load stats, but only assert on success of converging single-user requests. Basically, only make sure that restarted proxies are eventually functional and stop this test from being flaky.
- Investigate how to fix the underlying issue causing 404s.
WDYT?
Hi @shawnh2 , @alexwo - that's interesting input. Yes, I assume that if failures are 404s, then we have a problem with new envoy proxies receiving traffic before they are programmed.
In that case, I think we can split this activity to two parts:
- Run load and log load stats, but only assert on success of converging single-user requests. Basically, only make sure that restarted proxies are eventually functional and stop this test from being flaky.
- Investigate how to fix the underlying issue causing 404s.
WDYT?
It's also possible that existing proxies fail, as they may refresh their cache when attempting to obtain a new cache snapshot version from newly started EGs. Maybe something as initialDelaySeconds suggested here can help: https://github.com/envoyproxy/gateway/issues/2810#issuecomment-1981979019
Or potentially: https://github.com/envoyproxy/gateway/pull/2918
I'm not sure that EG infra controller, will re-deploy proxies in every EG upgrade test, perhaps only if there are changes that require re-deployment?
this test skiped in https://github.com/envoyproxy/gateway/pull/3306, can you remove this SkipTests?
https://github.com/envoyproxy/gateway/blob/d32256c3febf2469dc6b5f70ac30bb7ee61bc579/test/e2e/upgrade/eg_upgrade_test.go#L56
hey @shawnh2 is this PR still needed ?
close this for now, will raise a fix if we have a clue.