Complement `TestPartialStateJoin/Outgoing_device_list_updates/Device_list_updates_reach_newly_joined_servers_in_partial_state_rooms` is flakey
https://github.com/matrix-org/synapse/actions/runs/3172224388/jobs/5166469780#step:4:2964
client.go:604: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (125.434036ms)
federation_room_join_partial_state_test.go:1674: @t24alice:hs1's partial state join to !0:host.docker.internal:38237 completed.
client.go:604: [CSAPI] PUT hs1/_matrix/client/v3/devices/PBJFBBTZIN => 200 OK (18.65378ms)
federation_room_join_partial_state_test.go:1677: @t24alice:hs1 sent device list update.
federation_room_join_partial_state_test.go:1534: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"PBJFBBTZIN","org.matrix.opentracing_context":"{}","prev_id":[35],"stream_id":38,"user_id":"@t24alice:hs1"}
federation_room_join_partial_state_test.go:1534: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"PBJFBBTZIN","org.matrix.opentracing_context":"{}","prev_id":[36],"stream_id":40,"user_id":"@t24alice:hs1"}
federation_room_join_partial_state_test.go:1680: @charlie, @derek and @elsie received device list update.
federation_room_join_partial_state_test.go:1534: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"PBJFBBTZIN","org.matrix.opentracing_context":"{}","prev_id":[38],"stream_id":39,"user_id":"@t24alice:hs1"}
panic: test timed out after 1h0m0s
goroutine 3428 [running]:
testing.(*M).startAlarm.func1()
/opt/hostedtoolcache/go/1.17.13/x64/src/testing/testing.go:1788 +0x8e
created by time.goFunc
/opt/hostedtoolcache/go/1.17.13/x64/src/time/sleep.go:180 +0x31
Synapse sent an extra device update, which the test wasn't interested in. I think putting the device update in a channel blocked, and then the test timed out after an hour.
I suspect this has the same cause as https://github.com/matrix-org/synapse/issues/13977. Because the test did not wait for the partial state join to complete fully, the final device list update happened before the join completed and was re-sent in case it was missed.
This ought to be fixed by https://github.com/matrix-org/complement/pull/487.
It's happened again despite https://github.com/matrix-org/complement/pull/487 https://github.com/matrix-org/synapse/actions/runs/3188796603/jobs/5201917557#step:6:2936
client.go:604: [CSAPI] GET hs1/_matrix/client/v3/rooms/!0:host.docker.internal:39435/members => 200 OK (82.12876ms)
federation_room_join_partial_state_test.go:1681: @t24alice:hs1's partial state join to !0:host.docker.internal:39435 completed.
client.go:604: [CSAPI] PUT hs1/_matrix/client/v3/devices/HDRGUEKYVH => 200 OK (22.403662ms)
federation_room_join_partial_state_test.go:1684: @t24alice:hs1 sent device list update.
federation_room_join_partial_state_test.go:1541: Complement server received m.device_list_update: {"device_display_name":"A new device name 2","device_id":"HDRGUEKYVH","org.matrix.opentracing_context":"{}","prev_id":[35],"stream_id":37,"user_id":"@t24alice:hs1"}
federation_room_join_partial_state_test.go:1541: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"HDRGUEKYVH","org.matrix.opentracing_context":"{}","prev_id":[36],"stream_id":40,"user_id":"@t24alice:hs1"}
federation_room_join_partial_state_test.go:1687: @charlie, @derek and @elsie received device list update.
federation_room_join_partial_state_test.go:1541: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"HDRGUEKYVH","org.matrix.opentracing_context":"{}","prev_id":[37],"stream_id":39,"user_id":"@t24alice:hs1"}
panic: test timed out after 1h0m0s
Why is Synapse sending an extra device list update?
It's happened again despite https://github.com/matrix-org/complement/pull/487 https://github.com/matrix-org/synapse/actions/runs/3188796603/jobs/5201917557#step:6:2936
Actually this failure mode is different to the original issue. Here the 2nd device list update is duplicated, instead of the 3rd.
This is a pretty annoying failure mode, since it causes complement to take a full hour to time out.
https://github.com/matrix-org/synapse/actions/runs/3232704311/jobs/5293638472
Note that this time the test failed after 3s, so presumably a different cause to the 1-hour timeout Sean saw above.
https://github.com/matrix-org/synapse/actions/runs/3329668215/jobs/5507175203 failed after 5 sec
https://github.com/matrix-org/synapse/actions/runs/3337525454/jobs/5524055737
https://github.com/matrix-org/synapse/actions/runs/3345575678/jobs/5541376258 is another 1-hour timeout.
We have a bunch of goroutine stack dumps here.
I struggled to understand it (grep for /runner) but it looked like we were blocked (deadlocked?) waiting to receive something over a channel.
Kegan advises that we should
- buffer the channels, so that we don't block putting stuff into them
- then move on with our lives.
I'd hazard a guess it's due to https://github.com/matrix-org/complement/blob/fecbe23bb835221ebefe512ae20fc7900405f061/tests/federation_room_join_partial_state_test.go#L1905 - if the select times out, we stop consuming and never consume again
in which case it would be a failed test
in which case, just buffer the channel and move on with your life
See conversation starting here. I will PR this shortly.
https://github.com/matrix-org/synapse/actions/runs/3602619515/jobs/6070924882
(Might be just a generic form of failure with partial joins rather than something wrong with this test)
That failure mode should be fixed by https://github.com/matrix-org/complement/pull/582
https://github.com/matrix-org/synapse/actions/runs/4279170219/jobs/7449734683 seen today