synapse icon indicating copy to clipboard operation
synapse copied to clipboard

Complement `TestPartialStateJoin/Outgoing_device_list_updates/Device_list_updates_reach_newly_joined_servers_in_partial_state_rooms` is flakey

Open squahtx opened this issue 3 years ago • 3 comments

https://github.com/matrix-org/synapse/actions/runs/3172224388/jobs/5166469780#step:4:2964

      client.go:604: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (125.434036ms)
      federation_room_join_partial_state_test.go:1674: @t24alice:hs1's partial state join to !0:host.docker.internal:38237 completed.
      client.go:604: [CSAPI] PUT hs1/_matrix/client/v3/devices/PBJFBBTZIN => 200 OK (18.65378ms)
      federation_room_join_partial_state_test.go:1677: @t24alice:hs1 sent device list update.
      federation_room_join_partial_state_test.go:1534: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"PBJFBBTZIN","org.matrix.opentracing_context":"{}","prev_id":[35],"stream_id":38,"user_id":"@t24alice:hs1"}
      federation_room_join_partial_state_test.go:1534: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"PBJFBBTZIN","org.matrix.opentracing_context":"{}","prev_id":[36],"stream_id":40,"user_id":"@t24alice:hs1"}
      federation_room_join_partial_state_test.go:1680: @charlie, @derek and @elsie received device list update.
      federation_room_join_partial_state_test.go:1534: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"PBJFBBTZIN","org.matrix.opentracing_context":"{}","prev_id":[38],"stream_id":39,"user_id":"@t24alice:hs1"}
  panic: test timed out after 1h0m0s
  
  goroutine 3428 [running]:
  testing.(*M).startAlarm.func1()
  	/opt/hostedtoolcache/go/1.17.13/x64/src/testing/testing.go:1788 +0x8e
  created by time.goFunc
  	/opt/hostedtoolcache/go/1.17.13/x64/src/time/sleep.go:180 +0x31

Synapse sent an extra device update, which the test wasn't interested in. I think putting the device update in a channel blocked, and then the test timed out after an hour.

squahtx avatar Oct 03 '22 10:10 squahtx

I suspect this has the same cause as https://github.com/matrix-org/synapse/issues/13977. Because the test did not wait for the partial state join to complete fully, the final device list update happened before the join completed and was re-sent in case it was missed.

This ought to be fixed by https://github.com/matrix-org/complement/pull/487.

squahtx avatar Oct 03 '22 15:10 squahtx

It's happened again despite https://github.com/matrix-org/complement/pull/487 https://github.com/matrix-org/synapse/actions/runs/3188796603/jobs/5201917557#step:6:2936

      client.go:604: [CSAPI] GET hs1/_matrix/client/v3/rooms/!0:host.docker.internal:39435/members => 200 OK (82.12876ms)
      federation_room_join_partial_state_test.go:1681: @t24alice:hs1's partial state join to !0:host.docker.internal:39435 completed.
      client.go:604: [CSAPI] PUT hs1/_matrix/client/v3/devices/HDRGUEKYVH => 200 OK (22.403662ms)
      federation_room_join_partial_state_test.go:1684: @t24alice:hs1 sent device list update.
      federation_room_join_partial_state_test.go:1541: Complement server received m.device_list_update: {"device_display_name":"A new device name 2","device_id":"HDRGUEKYVH","org.matrix.opentracing_context":"{}","prev_id":[35],"stream_id":37,"user_id":"@t24alice:hs1"}
      federation_room_join_partial_state_test.go:1541: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"HDRGUEKYVH","org.matrix.opentracing_context":"{}","prev_id":[36],"stream_id":40,"user_id":"@t24alice:hs1"}
      federation_room_join_partial_state_test.go:1687: @charlie, @derek and @elsie received device list update.
      federation_room_join_partial_state_test.go:1541: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"HDRGUEKYVH","org.matrix.opentracing_context":"{}","prev_id":[37],"stream_id":39,"user_id":"@t24alice:hs1"}
  panic: test timed out after 1h0m0s

Why is Synapse sending an extra device list update?

squahtx avatar Oct 05 '22 17:10 squahtx

It's happened again despite https://github.com/matrix-org/complement/pull/487 https://github.com/matrix-org/synapse/actions/runs/3188796603/jobs/5201917557#step:6:2936

Actually this failure mode is different to the original issue. Here the 2nd device list update is duplicated, instead of the 3rd.

squahtx avatar Oct 05 '22 18:10 squahtx

This is a pretty annoying failure mode, since it causes complement to take a full hour to time out.

squahtx avatar Oct 07 '22 14:10 squahtx

https://github.com/matrix-org/synapse/actions/runs/3232704311/jobs/5293638472

Note that this time the test failed after 3s, so presumably a different cause to the 1-hour timeout Sean saw above.

DMRobertson avatar Oct 12 '22 10:10 DMRobertson

https://github.com/matrix-org/synapse/actions/runs/3329668215/jobs/5507175203 failed after 5 sec

DMRobertson avatar Oct 26 '22 14:10 DMRobertson

https://github.com/matrix-org/synapse/actions/runs/3337525454/jobs/5524055737

DMRobertson avatar Oct 27 '22 13:10 DMRobertson

https://github.com/matrix-org/synapse/actions/runs/3345575678/jobs/5541376258 is another 1-hour timeout.

We have a bunch of goroutine stack dumps here.

I struggled to understand it (grep for /runner) but it looked like we were blocked (deadlocked?) waiting to receive something over a channel.

DMRobertson avatar Oct 28 '22 14:10 DMRobertson

Kegan advises that we should

  • buffer the channels, so that we don't block putting stuff into them
  • then move on with our lives.

I'd hazard a guess it's due to https://github.com/matrix-org/complement/blob/fecbe23bb835221ebefe512ae20fc7900405f061/tests/federation_room_join_partial_state_test.go#L1905 - if the select times out, we stop consuming and never consume again

in which case it would be a failed test

in which case, just buffer the channel and move on with your life

See conversation starting here. I will PR this shortly.

DMRobertson avatar Oct 28 '22 15:10 DMRobertson

https://github.com/matrix-org/synapse/actions/runs/3602619515/jobs/6070924882

DMRobertson avatar Dec 02 '22 16:12 DMRobertson

(Might be just a generic form of failure with partial joins rather than something wrong with this test)

DMRobertson avatar Dec 02 '22 16:12 DMRobertson

That failure mode should be fixed by https://github.com/matrix-org/complement/pull/582

erikjohnston avatar Jan 12 '23 09:01 erikjohnston

https://github.com/matrix-org/synapse/actions/runs/4279170219/jobs/7449734683 seen today

DMRobertson avatar Feb 27 '23 11:02 DMRobertson