complement icon indicating copy to clipboard operation
complement copied to clipboard

TestOutboundFederationIgnoresMissingEventWithBadJSONForRoomVersion6 causes other tests to be flaky

Open blackmad opened this issue 1 year ago • 0 comments

In my synapse fork, I was seeing consistent flakiness when running complement on GH actions.

These tests were failing

❌ TestFederationKeyUploadQuery (580ms) ❌ TestKnockingInMSC3787Room (570ms) ❌ TestRestrictedRoomsRemoteJoinFailOverInMSC3787Room (6.73s) ❌ TestToDeviceMessagesOverFederation (7.73s) ❌ TestToDeviceMessagesOverFederation/interrupted_connectivity (6.14s) ❌ TestToDeviceMessagesOverFederation/stopped_server (20ms)

When adding debugging, I saw

""" synapse_main | 2024-12-04 15:52:18,608 - synapse.federation.federation_base - 303 - ERROR - _process_incoming_pdus_in_room_inner-4-$Vp3-8StRMJno4kS-21Sr_LSK47Gk9HS4CIBdOjrsKbk - Invalid canonical JSON: {'auth_events': ['$3TeeAwpC6Edh_I_orHhXdCQLkztifWjjmSd78dU4qS0', '$7hX4UVoc-RUi5_agACZsqfotFvoqWkdwE0DMmiRQDL8', '$-ylJZyK-Hyxn_WOewXKLVD1XnVdAC_p2lxDLqKiG2pM'], 'content': {'bad_val': 1.1, 'body': 'Message 1'}, 'depth': 6, 'hashes': {'sha256': '+PxMZ1aox2NRluRwq0ctXEKXZ2NMsJG0yyKIBl1EQzg'}, 'origin': 'host.docker.internal:38621', 'origin_server_ts': 1733327538532, 'prev_events': ['$PO2EDaOjQOTSYMB0cN588VLhNn3JXUK2tHnLAWeKYG4'], 'room_id': '!0-1WnNO2FKvuc021cNJ3:host.docker.internal:38621', 'sender': '@charlie:host.docker.internal:38621', 'signatures': {'host.docker.internal:38621': {'ed25519:complement_aeff3b6780deb126c603cb94fcaefc9f922ad031cbab161c6f32014bac2354d1': '2Eidd/749jgA9rtwlAg54OENuddfKqY3P9YYBkvcZl5wx43ZhfvdkV9E4tLPwhbKVokwE5mOrs1l8flvjDsHDA'}}, 'type': 'm.room.message'} 400: Bad JSON value: 1.1 """

This was happening on fetching prev_events image

I noticed that the only place in complement where bad_val comes from is the test TestOutboundFederationIgnoresMissingEventWithBadJSONForRoomVersion6 https://github.com/matrix-org/complement/blob/fc63446512261a496b794a7082a6598b6f98e925/tests/federation_room_get_missing_events_test.go#L205

by disabling that one test, the rest of my tests started passing.

So it seems like a combination of 1) should synapse be failing on failing to deserialize prev_events? but also 2) why is this one test polluting others in the database? (and, relatedly, how do we isolate it?)

blackmad avatar Dec 19 '24 12:12 blackmad