agones icon indicating copy to clipboard operation
agones copied to clipboard

Flaky: TestGameServerRestartBeforeReadyCrash

Open markmandel opened this issue 3 years ago • 2 comments

What happened:

The test TestGameServerRestartBeforeReadyCrash is quite flaky!

Short version:

--- FAIL: TestGameServerRestartBeforeReadyCrash (213.29s)
    gameserver_test.go:353: 
        	Error Trace:	gameserver_test.go:353
        	Error:      	Received unexpected error:
        	            	timed out waiting for the condition
        	Test:       	TestGameServerRestartBeforeReadyCrash

What you expected to happen:

The test to pass consistently.

How to reproduce it (as minimally and precisely as possible):

Run builds.

Anything else we need to know?:

Long log:

time="2022-01-21 05:46:12.582" level=info msg="Waiting for us to have an address to send things to" test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:15.786" level=info msg="Waiting for states to match" awaitingState=Scheduled currentState=Scheduled gs=game-servertf8gs test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:15.786" level=info msg="GameServer created" gs=game-servertf8gs test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:15.786" level=info msg="Dialing UDP message to address" address="35.247.116.52:7735" test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:15.786" level=info msg="crashing, and waiting to see restart" test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:18.285" level=info msg="sending message" fields.msg=CRASH gs=game-servertf8gs state=Scheduled test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:21.030" level=info msg="sending message" fields.msg=CRASH gs=game-servertf8gs state=Scheduled test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:23.832" level=info msg="sending message" fields.msg=CRASH gs=game-servertf8gs state=Scheduled test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:26.831" level=info msg="sending message" fields.msg=CRASH gs=game-servertf8gs state=Scheduled test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:29.633" level=info msg="successfully crashed. Moving on!" test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:32.769" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-servertf8gs test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:36.168" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-servertf8gs test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:39.369" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-servertf8gs test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:41.567" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-servertf8gs test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:41.568" level=info msg="marking GameServer as ready" test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:44.433" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Scheduled test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:47.431" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:50.034" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:53.035" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:56.237" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:59.235" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:01.834" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:04.434" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:07.232" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:10.433" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:13.834" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:17.034" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:19.633" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:22.634" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:26.032" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:29.035" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:31.449" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:34.447" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:37.235" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:40.233" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:43.234" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:46.235" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:49.451" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:52.232" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:54.633" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:57.469" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:00.033" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:02.641" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:05.033" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:07.033" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:10.233" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:13.033" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:15.633" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:18.633" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:21.633" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:24.432" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:27.236" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:30.235" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:32.832" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:35.635" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:38.432" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:41.234" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:44.433" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:47.436" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:50.234" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:52.833" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:55.067" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:57.834" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:00.633" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:03.030" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:05.236" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:07.434" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:09.633" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:11.432" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:13.234" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:15.433" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:17.234" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:19.034" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:21.032" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:22.234" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:24.033" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:25.834" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:27.432" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:28.832" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:30.232" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:31.434" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:32.831" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:33.634" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:34.433" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:35.434" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:36.033" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:36.433" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:37.033" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:37.835" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:38.432" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:38.834" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:39.233" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:39.834" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:40.162" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:40.635" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:41.032" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:41.234" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:41.834" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:42.161" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:42.632" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:43.034" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:43.434" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:43.837" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:44.161" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:44.634" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:45.037" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:45.037" level=info msg="crashing again, should be unhealthy" test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:45.260" level=info msg="checking final crash state" gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:45.260" level=info msg="Unhealthy! We are done!" test=TestGameServerRestartBeforeReadyCrash
--- FAIL: TestGameServerRestartBeforeReadyCrash (213.29s)
        	Test:       	TestGameServerRestartBeforeReadyCrash

Environment: N/A

  • Agones version: dev
  • Kubernetes version (use kubectl version): 1.21.0
  • Cloud provider or hardware configuration: GKE
  • Install method (yaml/helm): helm
  • Troubleshooting guide log(s): See above
  • Others: N/A

markmandel avatar Jan 21 '22 15:01 markmandel

This has happened in a few builds lately, but I think the line numbers may have changed - or not

22: e2e-feature-gates https://github.com/googleforgames/agones/pull/2547#issuecomment-1104473512

--- FAIL: TestGameServerRestartBeforeReadyCrash (223.08s)
    gameserver_test.go:353: 
        	Error Trace:	gameserver_test.go:353
        	Error:      	Received unexpected error:
        	            	timed out waiting for the condition
        	Test:       	TestGameServerRestartBeforeReadyCrash

markmandel avatar Apr 23 '22 02:04 markmandel

Just tested this locally, looks like this is still an issue.

root@8308eef5623c:/go/src/agones.dev/agones/test/e2e# go test -race -run TestGameServerRestartBeforeReadyCrash -count 100
....
INFO[2022-08-26 19:45:35.856] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:36.057] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:36.258] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:36.461] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:36.666] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:36.862] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:37.060] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:37.260] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:37.452] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:37.658] successfully crashed. Moving on!              test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:37.691] Waiting for states to match                   awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:38.728] Waiting for states to match                   awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:39.731] Waiting for states to match                   awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:40.732] Waiting for states to match                   awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:41.733] Waiting for states to match                   awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:42.729] Waiting for states to match                   awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:42.772] Waiting for states to match                   awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:42.772] marking GameServer as ready                   test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:42.846] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:43.129] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:43.333] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:43.536] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:43.740] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:43.930] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:44.138] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:44.330] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:44.531] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:44.730] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:44.930] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:45.133] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:45.334] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:45.539] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:45.720] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:45.941] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:46.136] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:46.342] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:46.536] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:46.733] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:46.940] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:47.120] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:47.325] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:47.528] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:47.737] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:47.937] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:48.130] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:48.336] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:48.531] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:48.744] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:48.931] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:49.135] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:49.333] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:49.531] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:49.722] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:49.937] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:50.148] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:50.344] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:50.539] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:50.737] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:50.935] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:51.138] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:51.337] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:51.533] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:51.728] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:51.933] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:52.133] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:52.342] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:52.533] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:52.731] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:52.936] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.126] sending message                               fields.msg=READY gs=game-serverj55pk state=RequestReady test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.330] sending message                               fields.msg=READY gs=game-serverj55pk state=RequestReady test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.533] sending message                               fields.msg=READY gs=game-serverj55pk state=RequestReady test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.720] ready! Moving On!                             test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.720] crashing again, should be unhealthy           test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.806] checking final crash state                    gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.806] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.086] checking final crash state                    gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.087] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.292] checking final crash state                    gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.293] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.496] checking final crash state                    gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.497] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.694] checking final crash state                    gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.694] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.899] checking final crash state                    gs=game-serverj55pk state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.900] Unhealthy! We are done!                       test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:55.016] Waiting for us to have an address to send things to  test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:55.082] Waiting for states to match                   awaitingState=Scheduled currentState=Creating gs=game-serverw57fc test=TestGameServerRestartBeforeReadyCrash
panic: test timed out after 10m0s

goroutine 4655 [running]:
testing.(*M).startAlarm.func1()
        /usr/local/go/src/testing/testing.go:1788 +0xbb
created by time.goFunc
        /usr/local/go/src/time/sleep.go:180 +0x4a

goroutine 1 [chan receive]:
testing.tRunner.func1()
        /usr/local/go/src/testing/testing.go:1225 +0x635
testing.tRunner(0xc00017eb60, 0xc000157980)
        /usr/local/go/src/testing/testing.go:1265 +0x269
testing.runTests(0xc000490680, {0x3299400, 0x46, 0x46}, {0x1204310, 0xc00052a1c0, 0x32abd00})
        /usr/local/go/src/testing/testing.go:1596 +0x7cb
testing.(*M).Run(0xc000490680)
        /usr/local/go/src/testing/testing.go:1504 +0x9d2
agones.dev/agones/test/e2e.TestMain(0x400)
        /go/src/agones.dev/agones/test/e2e/main_test.go:94 +0x9e8
main.main()
        _testmain.go:183 +0x265

goroutine 4 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x0)
        /go/src/agones.dev/agones/vendor/k8s.io/klog/v2/klog.go:1181 +0x8b
created by k8s.io/klog/v2.init.0
        /go/src/agones.dev/agones/vendor/k8s.io/klog/v2/klog.go:420 +0x1c5

goroutine 30 [IO wait]:
internal/poll.runtime_pollWait(0x7f5b3133a7d8, 0x72)
        /usr/local/go/src/runtime/netpoll.go:229 +0x89
internal/poll.(*pollDesc).wait(0xc00011e118, 0xc000012000, 0x0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0xbd
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00011e100, {0xc000012000, 0x44ee, 0x44ee})
        /usr/local/go/src/internal/poll/fd_unix.go:167 +0x419
net.(*netFD).Read(0xc00011e100, {0xc000012000, 0x44ee, 0x44ee})
        /usr/local/go/src/net/fd_posix.go:56 +0x51
net.(*conn).Read(0xc000306010, {0xc000012000, 0x44ee, 0x44ee})
        /usr/local/go/src/net/net.go:183 +0xb1
crypto/tls.(*atLeastReader).Read(0xc0001319b0, {0xc000012000, 0x44ee, 0x44ee})
        /usr/local/go/src/crypto/tls/conn.go:777 +0x86
bytes.(*Buffer).ReadFrom(0xc0000ae278, {0x255b700, 0xc0001319b0})
        /usr/local/go/src/bytes/buffer.go:204 +0x113
crypto/tls.(*Conn).readFromUntil(0xc0000ae000, {0x255e220, 0xc000306010}, 0x5)
        /usr/local/go/src/crypto/tls/conn.go:799 +0x1df
crypto/tls.(*Conn).readRecordOrCCS(0xc0000ae000, 0x0)
        /usr/local/go/src/crypto/tls/conn.go:606 +0x3fe
crypto/tls.(*Conn).readRecord(...)
        /usr/local/go/src/crypto/tls/conn.go:574
crypto/tls.(*Conn).Read(0xc0000ae000, {0xc00028b000, 0x1000, 0x0})
        /usr/local/go/src/crypto/tls/conn.go:1277 +0x29c
bufio.(*Reader).Read(0xc000700540, {0xc00027e3c0, 0x9, 0x9})
        /usr/local/go/src/bufio/bufio.go:227 +0x4db
io.ReadAtLeast({0x255b560, 0xc000700540}, {0xc00027e3c0, 0x9, 0x9}, 0x9)
        /usr/local/go/src/io/io.go:328 +0xde
io.ReadFull(...)
        /usr/local/go/src/io/io.go:347
golang.org/x/net/http2.readFrameHeader({0xc00027e3c0, 0x9, 0x9}, {0x255b560, 0xc000700540})
        /go/src/agones.dev/agones/vendor/golang.org/x/net/http2/frame.go:237 +0x96
golang.org/x/net/http2.(*Framer).ReadFrame(0xc00027e380)
        /go/src/agones.dev/agones/vendor/golang.org/x/net/http2/frame.go:498 +0x108
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc0005d3f78)
        /go/src/agones.dev/agones/vendor/golang.org/x/net/http2/transport.go:2101 +0x1f3
golang.org/x/net/http2.(*ClientConn).readLoop(0xc0004b1080)
        /go/src/agones.dev/agones/vendor/golang.org/x/net/http2/transport.go:1997 +0xb5
created by golang.org/x/net/http2.(*Transport).newClientConn
        /go/src/agones.dev/agones/vendor/golang.org/x/net/http2/transport.go:725 +0x14cb

goroutine 4703 [select]:
k8s.io/apimachinery/pkg/util/wait.poller.func1.1()
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:708 +0x2ed
created by k8s.io/apimachinery/pkg/util/wait.poller.func1
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:691 +0x12c

goroutine 4701 [select]:
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext({0x2589408, 0xc0000420c8}, 0xc000323260, 0x18)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:658 +0x189
k8s.io/apimachinery/pkg/util/wait.poll({0x2589408, 0xc0000420c8}, 0x1, 0x198f601, 0x4a1045)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594 +0xe6
k8s.io/apimachinery/pkg/util/wait.PollImmediateWithContext({0x2589408, 0xc0000420c8}, 0xc0003f7540, 0x0, 0x0)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:526 +0x66
k8s.io/apimachinery/pkg/util/wait.PollImmediate(0xc0002edab0, 0x22a41e0, 0x4)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:512 +0x71
agones.dev/agones/test/e2e/framework.(*Framework).WaitForGameServerState(0xc000491f00, 0xc00017ed00, 0xc0000d5400, {0x22a968d, 0x9}, 0x0)
        /go/src/agones.dev/agones/test/e2e/framework/framework.go:257 +0x2dc
agones.dev/agones/test/e2e.TestGameServerRestartBeforeReadyCrash(0xc00017ed00)
        /go/src/agones.dev/agones/test/e2e/gameserver_test.go:278 +0x705
testing.tRunner(0xc00017ed00, 0x2366730)
        /usr/local/go/src/testing/testing.go:1259 +0x230
created by testing.(*T).Run
        /usr/local/go/src/testing/testing.go:1306 +0x727
exit status 2
FAIL    agones.dev/agones/test/e2e      600.514s

markmandel avatar Aug 26 '22 20:08 markmandel

I saw this fail recently as well. Grabbing as a good first issue.

zmerlynn avatar Nov 02 '22 18:11 zmerlynn

Re: https://github.com/googleforgames/agones/issues/2445#issuecomment-1228927983, the problem here (and I ran into it as well), is that -count <X> is still constrained by -timeout, as evidenced by:

INFO[2022-08-26 19:45:55.082] Waiting for states to match                   awaitingState=Scheduled currentState=Creating gs=game-serverw57fc test=TestGameServerRestartBeforeReadyCrash
panic: test timed out after 10m0s

I'm trying this locally with -count 100 -timeout 0 now, but I saw a flake here yesterday, too: https://github.com/googleforgames/agones/pull/2782#issuecomment-1298755291 .

zmerlynn avatar Nov 02 '22 21:11 zmerlynn

I feel like this one is a nightmare to replicate.

markmandel avatar Nov 02 '22 21:11 markmandel

Breadcrumbs: In the case of https://github.com/googleforgames/agones/issues/2445#issue-1110631394 and https://github.com/googleforgames/agones/pull/2782#issuecomment-1298755291 both, we see this kind of odd pattern:

Step #23 - "e2e-stable": time="2022-11-01 15:59:22.597" level=info msg="checking final crash state" gs=game-serverxsdhc state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:59:22.598" level=info msg="Unhealthy! We are done!" test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": --- FAIL: TestGameServerRestartBeforeReadyCrash (205.04s)
Step #23 - "e2e-stable":     gameserver_test.go:354:
Step #23 - "e2e-stable":                Error Trace:    gameserver_test.go:354
Step #23 - "e2e-stable":                Error:          Received unexpected error:
Step #23 - "e2e-stable":                                timed out waiting for the condition
Step #23 - "e2e-stable":                Test:           TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:59:23.025" level=info msg="Checking Fleet Ready replicas" expected=1 fleet=simple-fleet-1.08n4sb fleetStatus="{Replicas:2 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:2 Players:<nil>}" test="TestFleetRollingUpdate/Use_fleet_Patch_true_25%_cycle_true"

note that we're saying Unhealthy! We are done!, but then failing on a timeout. I included the line after it to understand the timing, which is very tight. Ok, but the Unhealthy! We are done! is from: https://github.com/googleforgames/agones/blob/dc3eb37f05e5c567e3d642ef4a77f0e59e993446/test/e2e/gameserver_test.go#L363

but the error trace is line 354 above it: https://github.com/googleforgames/agones/blob/dc3eb37f05e5c567e3d642ef4a77f0e59e993446/test/e2e/gameserver_test.go#L354

ETA: This is presumably because assert.NoError only calls t.Errorf - so we see a timeout and then continue on.

zmerlynn avatar Nov 02 '22 21:11 zmerlynn

I think in both cases I can see this pattern (** for emphasis):

Step #23 - "e2e-stable": time="2022-11-01 15:56:08.199" level=info msg="successfully crashed. Moving on!" test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:10.391" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-serverxsdhc test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:14.192" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-serverxsdhc test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:16.592" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-serverxsdhc test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:19.391" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-serverxsdhc test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:19.391" level=info msg="marking GameServer as ready" test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:22.198" level=info msg="sending message" fields.msg=READY gs=game-serverxsdhc state=Scheduled test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:24.399" level=info msg="sending message" fields.msg=READY gs=game-serverxsdhc state=Scheduled test=TestGameServerRestartBeforeReadyCrash
** Step #23 - "e2e-stable": time="2022-11-01 15:56:27.397" level=info msg="sending message" fields.msg=READY gs=game-serverxsdhc state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:29.998" level=info msg="sending message" fields.msg=READY gs=game-serverxsdhc state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:33.199" level=info msg="sending message" fields.msg=READY gs=game-serverxsdhc state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:36.398" level=info msg="sending message" fields.msg=READY gs=game-serverxsdhc state=Unhealthy test=TestGameServerRestartBeforeReadyCrash

In particular, it looks like during the flakes, the GameServer is always transitioning Scheduled -> Unhealthy.

zmerlynn avatar Nov 02 '22 21:11 zmerlynn

I have not been able to repro locally yet (hundreds of runs over the afternoon) - this may take some logs analysis.

I scoured the build logs using the hyper-advanced technique of:

gsutil -m cp gs://agones-build-logs/* .
egrep "Test:.*TestGameServerRestartBeforeReadyCrash" *.txt

Unfortunately, build retention is short enough that I only see the failure in https://github.com/googleforgames/agones/pull/2782#issuecomment-1298755291.

Going to follow up on the CI clusters and see if I can maybe poke at logs there.

zmerlynn avatar Nov 03 '22 00:11 zmerlynn

Another dupe: https://github.com/googleforgames/agones/pull/2790#issuecomment-1301529764

zmerlynn avatar Nov 03 '22 01:11 zmerlynn

So, I've been doing some logs analysis on https://github.com/googleforgames/agones/pull/2782#issuecomment-1298755291. What I see is that we crash the container twice, and due to backoff, there's a pretty big gap in health checks:

image

It's not clear this gap matters, but it's certainly curious. But then it does get to RequestReady state:

image

.. and fails just as quickly?

FWIW, the pod update there seems to be .. possibly irrelevant, not sure: image

Anyways, still looking, and now I have another dupe to look at tomorrow.

zmerlynn avatar Nov 03 '22 01:11 zmerlynn

I have a theory as to what's happening here. I'll distill it down to a time-sequence first, then show the logs. I think what's happening is:

time event
0 container c14f starts
1 kubelet updates Pod status to show container c14f as running (restartCount=0)
2 container c14f crashes (forced by CRASH message from e2e)
3 kubelet restarts container, container 7c7c starts
4 e2e sends another CRASH
5 kubelet updates Pod status to shows container 7c7c as running (restartCount=1)
6 e2e recognizes crash has succeeded and stops sending CRASH, moves on to send READY
7 container 7c7c crashes (forced by CRASH message at t=4)
8 kubelet restarts container, container 2e5d starts
9 container 2e5d receives READY, moving GameServer to RequestReady
10 controller sees RequestReady and moves GameServer to Ready, with GameServerReadyContainerIDAnnotation as 7c7c. (Reminder, controller has not seen a Pod update yet)
11 kubelet updates Pod status to shows container 2e5d as running (restartCount=2)
12 controller sees 2e5d != 7c7c and assumes Pod died after calling Ready().

We see echoes of this in the logs for https://github.com/googleforgames/agones/pull/2790#issuecomment-1301529764 (I am eliding the first container, c14f - the only importance of that container is that the e2e moved on from CRASH to READY):

  • 2022-11-03T00:46:03.552890Z: 7c7c dies
SyncLoop (PLEG): event for pod" pod="1667436145/game-server5m66g" event=&{ID:cc19f829-800c-47b6-b93e-619e07cddd66 Type:ContainerDied Data:7c7ce1cfe5f4717e5fbf14e68806a59a8babde538834ffbb015dcbb1ecb02157}
  • 2022-11-03T00:46:14.622672Z: 2e5d started
SyncLoop (PLEG): event for pod" pod="1667436145/game-server5m66g" event=&{ID:cc19f829-800c-47b6-b93e-619e07cddd66 Type:ContainerStarted Data:2e5db813bc74279ad454d1974637aa2f4c16f32ea77c942f300e6480db0bd7b4}
  • 2022-11-03T00:46:19.171202006Z: We see a Syncing RequestReady State. Note that this is the last log in that message - there's also a SDK.Ready() complete event recorded during the same second (46:19) that appears out of order in Logs Explorer due to lack of precision in the event log.

  • 2022-11-03T00:46:19.191195Z: just after SDK.Ready we see a Pod update that still shows 7c7c as running - this is why I am guessing the sequence around t=10 above:

          {
            "ready": true,
            "lastState": {
              "terminated": {
                "exitCode": 1,
                "reason": "Error",
                "finishedAt": "2022-11-03T00:45:57Z",
                "startedAt": "2022-11-03T00:45:55Z",
                "containerID": "containerd://c14f0b62e3cff080fd9662f1478fd06eef363a878dd732c45dfabf92017e56bd"
              }
            },
            "name": "game-server",
            "image": "gcr.io/agones-images/simple-game-server:0.14",
            "imageID": "gcr.io/agones-images/simple-game-server@sha256:564578aa131ae856bdb7aef3f389a34fc73e7e396b5838ae9bc65f80909b8b96",
            "containerID": "containerd://7c7ce1cfe5f4717e5fbf14e68806a59a8babde538834ffbb015dcbb1ecb02157",
            "state": {
              "running": {
                "startedAt": "2022-11-03T00:45:59Z"
              }
            },
            "started": true,
            "restartCount": 1
          }
  • 2022-11-03T00:46:20.772518856Z: agones-controller sees the container 2e5d for the first time and that 7c7c has failed:
{
  "insertId": "7tw4phgwr0rveiqq",
  "jsonPayload": {
    "containerStatuses": [
      {
        "imageID": "us-docker.pkg.dev/agones-images/ci/agones-sdk@sha256:73413d1cf61febccf1305f22195309c01b8be402cbde7f9fb2762b0f7fea543c",
        "image": "us-docker.pkg.dev/agones-images/ci/agones-sdk:1.28.0-66f21f3",
        "name": "agones-gameserver-sidecar",
        "started": true,
        "lastState": {},
        "ready": true,
        "restartCount": 0,
        "state": {
          "running": {
            "startedAt": "2022-11-03T00:45:55Z"
          }
        },
        "containerID": "containerd://197488cfc52f5cc6683c0c96d7c92bdc517675627b3e465c64ad02a5fb602216"
      },
      {
        "imageID": "gcr.io/agones-images/simple-game-server@sha256:564578aa131ae856bdb7aef3f389a34fc73e7e396b5838ae9bc65f80909b8b96",
        "containerID": "containerd://2e5db813bc74279ad454d1974637aa2f4c16f32ea77c942f300e6480db0bd7b4",
        "state": {
          "running": {
            "startedAt": "2022-11-03T00:46:14Z"
          }
        },
        "name": "game-server",
        "restartCount": 2,
        "image": "gcr.io/agones-images/simple-game-server:0.14",
        "ready": true,
        "lastState": {
          "terminated": {
            "containerID": "containerd://7c7ce1cfe5f4717e5fbf14e68806a59a8babde538834ffbb015dcbb1ecb02157",
            "reason": "Error",
            "finishedAt": "2022-11-03T00:46:00Z",
            "exitCode": 1,
            "startedAt": "2022-11-03T00:45:59Z"
          }
        },
        "started": true
      }
    ],
    "gs": "game-server5m66g",
    "message": "Container Failed",
    "container": "game-server",
    "source": "*gameservers.HealthController"
  },
  "resource": {
    "type": "k8s_container",
    "labels": {
      "location": "us-west1-c",
      "project_id": "agones-images",
      "namespace_name": "agones-system",
      "container_name": "agones-controller",
      "pod_name": "agones-controller-6b94677bc6-rnbwq",
      "cluster_name": "e2e-test-cluster"
    }
  },
  "timestamp": "2022-11-03T00:46:20.772518856Z",
  "severity": "DEBUG",
  "labels": {
    "k8s-pod/agones_dev/role": "controller",
    "k8s-pod/heritage": "Helm",
    "k8s-pod/release": "agones",
    "k8s-pod/app": "agones",
    "k8s-pod/pod-template-hash": "6b94677bc6",
    "compute.googleapis.com/resource_name": "gke-e2e-test-cluster-agones-system-79cf09e3-bgs7"
  },
  "logName": "projects/agones-images/logs/stderr",
  "receiveTimestamp": "2022-11-03T00:46:25.762176239Z"
}

zmerlynn avatar Nov 03 '22 16:11 zmerlynn

This is some super interesting analysis!

The thing I'm not 100% sure on with the analysis, is that this is the order of updates on the Pod first, then the GameServer for the GameServerReadyContainerIDAnnotation

https://github.com/googleforgames/agones/blob/3c38876382d505244af68489211d8916dce07596/pkg/gameservers/controller.go#L854-L864

So if the Pod is currently out of sync, the Update call should fail, since there is a newer generation in K8s.

Also, in the health controller, if the annotations are out of sync with each other (which I don't think should ever happen? Is that what you are definitely seeing here?) the health controller will return an error, to kick it into a retry, rather than move it to Unhealthy. 🤔

https://github.com/googleforgames/agones/blob/f9c333d25417fc5523583bb294ce520e7a33830b/pkg/gameservers/health.go#L255-L257

markmandel avatar Nov 08 '22 21:11 markmandel

| So if the Pod is currently out of sync, the Update call should fail, since there is a newer generation in K8s.

Hmm. I think that as long as step 10 and 11 occur in this order (per earlier comment):

time event
7 container 7c7c crashes (forced by CRASH message at t=4)
8 kubelet restarts container, container 2e5d starts
9 container 2e5d receives READY, moving GameServer to RequestReady
10 controller sees RequestReady and moves GameServer to Ready, with GameServerReadyContainerIDAnnotation as 7c7c. (Reminder, controller has not seen a Pod update yet)
11 kubelet updates Pod status to shows container 2e5d as running (restartCount=2)

then kubelet basically loses the race to update the pod (it's update in step 10, I didn't make that explicit).

| Also, in the health controller, if the annotations are out of sync with each other (which I don't think should ever happen? Is that what you are definitely seeing here?) the health controller will return an error, to kick it into a retry

I haven't seen the error from line 256, so I assume not.

zmerlynn avatar Nov 11 '22 22:11 zmerlynn

Wait! Yes, the second screenshot in https://github.com/googleforgames/agones/issues/2445#issuecomment-1301540694 had exactly that error message: pod and gameserver game-serverxsdhc data are out of sync, retrying

zmerlynn avatar Nov 11 '22 23:11 zmerlynn

~Oh, except that error is also exactly what you expect after the container restarts, that's right. (Trying to page this all back in.)~ (ignore me, I think I'm confused.)

zmerlynn avatar Nov 11 '22 23:11 zmerlynn

The health.go:256 error message is present, but I think that's expected in between the update to the Pod and the update to the GameServer.

This going to be a bit spammy with pictures, I hope they help. In order, I see:


  • game-server updates to RequestReady (I surmise this by callerSuppliedUserAgent being sdk-server/v0.0.0 [...], but I didn't include that in the screen shot) (this is step 9 in the sequence above): image

  • I think this is the updates from line 855 and line 861. Note that in the middle we see some red herring updates from other controllers, including the out of sync, retrying. This is expected between updates. This is step 10 in the sequence above: image

  • Right after, we signal SDK.Ready(): image

  • But then right after, I believe this is kubelet updating the pod (SyncLoop UPDATE is basically saying "I found an update during my sync loop"), after which we notice Container Failed. This is step 11 in the sequence above: image


I realize I am inferring a lot from the SyncLoop, but unfortunately we don't keep audit logs on Pod.Status updates, so it's hard to draw a strong conclusion. We know that line 855/861 result in synchronized (but bad) state of GameServerReadyContainerIDAnnotation because we see the skipUnhealthyGameContainer: Container crashed after Ready, returning false message that happens after the HealthController verifies the pod/GS match: https://github.com/googleforgames/agones/blob/d5cf2b0c8ed5e085cd532ae476b0dbdd15b35a29/pkg/gameservers/health.go#L279

So I'm pretty confident in this analysis. I think if kubelet loses the race, we can't actually use the container ID the way it's being used here.

I question the value of this feature and would like to return to an earlier question I asked on chat: Is this solving a real problem? Do customers often have workloads that flap before Ready(), i.e. a "settling period"? Or is this just kind of a nice to have feature? If it's the latter, I'd really recommend we just move this to "any termination => Unhealthy" and eliminate this logic. It's really complicated to get right.

If it's a really necessary feature for some usecase, I think there's a better way to handle this, but it's invasive: The game server container can generate a random cookie/nonce and communicate that with Ready() and on healthchecks. On healthcheck, the SDK checks the cookie and fail healthchecks immediately if there's any mismatch. Basically, take any of the container analysis out of it and just make this strictly a conversation between the game server and the sidecar - that should be a lot more bulletproof at detecting that the game server that called Ready() is also the one that's currently running.

ETA: Another option to keep the feature might be to implement some delay to RequestReady. We could implement a speed-bump (even 1s would probably close it) at https://github.com/googleforgames/agones/blob/d5cf2b0c8ed5e085cd532ae476b0dbdd15b35a29/pkg/sdkserver/sdkserver.go#L443

zmerlynn avatar Nov 12 '22 00:11 zmerlynn

Also, this race only really exists because the game server in this case was able to call Ready() so quickly. Another way to close the race from the unit test perspective is to add a speed bump to simple-game-server (it could even be done for only this test). With a minor delay, I expect kubelet will always win the race.

If we fix it this way, we are effectively agreeing that in a narrow set of circumstances, the GameServer may land in Unhealthy unnecessarily - i.e. we accept that the race exists but in 99.9% of cases it just doesn't matter. That's an ok compromise, I suspect.

zmerlynn avatar Nov 14 '22 18:11 zmerlynn