gateway icon indicating copy to clipboard operation
gateway copied to clipboard

fix e2e EnvoyShutdown

Open shawnh2 opened this issue 1 year ago • 7 comments

What type of PR is this?

What this PR does / why we need it:

  • add test into codecov ignore path
  • trying to fix #3262 by increasing the HTTPRoute Duration and move load generation after the deployment has been restarted
  • fix the order of merge-gateways e2e test in makefile

Which issue(s) this PR fixes:

Fixes #3262

shawnh2 avatar Apr 26 '24 08:04 shawnh2

/retest

shawnh2 avatar Apr 28 '24 10:04 shawnh2

move load generation after the deployment has been restarted

Hi @shawnh2! The purpose of the test is to run load during the restart and assert that there's no loss of requests... Running it after restart means that we're not really testing for hitless shutdown anymore..

guydc avatar Apr 29 '24 11:04 guydc

move load generation after the deployment has been restarted

Hi @shawnh2! The purpose of the test is to run load during the restart and assert that there's no loss of requests... Running it after restart means that we're not really testing for hitless shutdown anymore..

Thanks for the clarification, by root causing this e2e test case, sometimes after the restart process, the url that one request send to may be a 404 path, meaning there are some loss requests.

Maybe by sending another request to detect the url path right after the restart process can avoid this situation.

shawnh2 avatar Apr 30 '24 02:04 shawnh2

move load generation after the deployment has been restarted

Hi @shawnh2! The purpose of the test is to run load during the restart and assert that there's no loss of requests... Running it after restart means that we're not really testing for hitless shutdown anymore..

Thanks for the clarification, by root causing this e2e test case, sometimes after the restart process, the url that one request send to may be a 404 path, meaning there are some loss requests.

Maybe by sending another request to detect the url path right after the restart process can avoid this situation.

Maybe the upgrade process isn't hitless anymore, I hadn't noticed the 404 issue before, so perhaps that's a new thing. Perhaps it's worth implementing a temporary workaround to prevent instability, but we should also address the underlying issue that causes some requests to result in a 404 error during an upgrade.

alexwo avatar Apr 30 '24 08:04 alexwo

Hi @shawnh2 , @alexwo - that's interesting input. Yes, I assume that if failures are 404s, then we have a problem with new envoy proxies receiving traffic before they are programmed.

In that case, I think we can split this activity to two parts:

  • Run load and log load stats, but only assert on success of converging single-user requests. Basically, only make sure that restarted proxies are eventually functional and stop this test from being flaky.
  • Investigate how to fix the underlying issue causing 404s.

WDYT?

guydc avatar Apr 30 '24 12:04 guydc

Hi @shawnh2 , @alexwo - that's interesting input. Yes, I assume that if failures are 404s, then we have a problem with new envoy proxies receiving traffic before they are programmed.

In that case, I think we can split this activity to two parts:

  • Run load and log load stats, but only assert on success of converging single-user requests. Basically, only make sure that restarted proxies are eventually functional and stop this test from being flaky.
  • Investigate how to fix the underlying issue causing 404s.

WDYT?

It's also possible that existing proxies fail, as they may refresh their cache when attempting to obtain a new cache snapshot version from newly started EGs. Maybe something as initialDelaySeconds suggested here can help: https://github.com/envoyproxy/gateway/issues/2810#issuecomment-1981979019

Or potentially: https://github.com/envoyproxy/gateway/pull/2918

I'm not sure that EG infra controller, will re-deploy proxies in every EG upgrade test, perhaps only if there are changes that require re-deployment?

alexwo avatar Apr 30 '24 12:04 alexwo

this test skiped in https://github.com/envoyproxy/gateway/pull/3306, can you remove this SkipTests?

https://github.com/envoyproxy/gateway/blob/d32256c3febf2469dc6b5f70ac30bb7ee61bc579/test/e2e/upgrade/eg_upgrade_test.go#L56

zirain avatar Apr 30 '24 14:04 zirain

hey @shawnh2 is this PR still needed ?

arkodg avatar May 16 '24 22:05 arkodg

close this for now, will raise a fix if we have a clue.

shawnh2 avatar May 17 '24 00:05 shawnh2