federation catchup is slow and network-heavy
If a remote HS takes a long time to process a transaction (perhaps because of #9490), then we will enter "catchup mode" for that homeserver. In this mode, we only ever send the most recent event per room, which means that the remote homeserver has to backfill for any intermediate events, which is, comparatively very slow (and may require calling /state_ids, see https://github.com/matrix-org/synapse/issues/7893 and https://github.com/matrix-org/synapse/issues/6597).
So, we have a homeserver which is struggling to keep up, and we are making its life even harder by only sending it a subset of the traffic in the room.
I'm not at all sure why we do this. Shouldn't we send all the events that the remote server hasn't received (up to a limit)?
What does the spec say? Shouldn't it try to send all events it can queue up (within reason)?
It looks like we fetch the oldest events that the remote has not yet received: https://github.com/matrix-org/synapse/blob/0a00b7ff14890987f09112a2ae696c61001e6cf1/synapse/federation/sender/per_destination_queue.py#L461-L465
https://github.com/matrix-org/synapse/blob/0a00b7ff14890987f09112a2ae696c61001e6cf1/synapse/storage/databases/main/transactions.py#L424-L442
rather than the most recent event per room?
destination_rooms stores the stream_ordering of the most recent event in each room that we should have sent to each destination, rather than the most recent event that we actually did send. So when we join to events, we get the event id of that single most recent event per room.
I wonder if another thing at play here is the interaction of all the different servers in the room. Take a busy room like Matrix HQ, now after a few hours of down time a server will receive the latest event from all servers that sent an event during that time, for each one the receiving server will first do /get_missing_events to fetch up to ten missing events, then do a /state_ids request if there is still a gap. This means that for a single room the server can end up processing a lot more than just the latest event, and end up with a many small chunks of DAG int the gap between it going offline and coming back online
Maybe rooms need something similar to ResponseCache.wrap(), in the sense that getting the backlog and state_id is handled by one worker, and every subsequent request is queued to that until it is done, this might require closer and more complex coordination (even more so across workers), but it'll solve that problem.
I wonder if another thing at play here is the interaction of all the different servers in the room. Take a busy room like Matrix HQ, now after a few hours of down time a server will receive the latest event from all servers that sent an event during that time, for each one the receiving server will first do
/get_missing_eventsto fetch up to ten missing events, then do a/state_idsrequest if there is still a gap. This means that for a single room the server can end up processing a lot more than just the latest event, and end up with a many small chunks of DAG int the gap between it going offline and coming back online
I wonder if Synapse should only send out events if they're a forward extremity? In the hopes that the server that has sent subsequent events in the room will send the latest event? It's possible that the other server won't, but that feels like a rare case. By doing so it should mean that the receiving server only receives the current extremities they've missed, rather than an additional smattering of events at different points in the gap.
Note that on March 18th a PR was merged that implements Erik's suggestion from above: https://github.com/matrix-org/synapse/pull/9640
yes, I've updated the summary of this issue, since it's no longer a simple matter of optimisation.