synapse icon indicating copy to clipboard operation
synapse copied to clipboard

Federation catchup doesn't send to_device EDUs until the remote end has caught up

Open matrixbot opened this issue 2 years ago • 2 comments

This issue has been migrated from #8691.


Description

When a remote server falls behind on federation, Synapse back off and starts batching up requests. Usually this isn't too bad as the remote end will only be maybe 1 or 2 transactions behind, however more serious occurrences can put the server behind by hundreds of transactions or thousands of events.

Many of the messages could be encrypted, which means they'll be potentially accompanied by to_device EDUs in order to decrypt the messages on the client side. If the EDUs aren't sent as part of the catchup transactions, it's possible for the clients to not be able to decrypt messages and thus make users sad/angry.

Here's an example of this happening in real life: image

For background on this graph: t2bot.io (the server in question) runs 2 federation readers, 1 of which (03) is dedicated to just handling matrix.org's traffic. The other (04) is left to handle any other random server which might exist in the wild.

In the graph, t2bot.io was behind on matrix.org's transactions and thus had a very spikey waveform due to the 50 PDU transactions having to be retried. When it did catch up, it was also met with all the EDUs it missed, creating a significant spike. Traffic after that is then normal.

This has been observed to happen on several catchups already, and only noticed today (with Synapse 1.22.0) - it's unclear if this is an issue in prior versions of synapse, or is a matrix.org federation sender-specific issue.

Version information

  • Homeserver: t2bot.io

If not matrix.org:

  • Version: 1.22.0 (with minor, unrelated, patches)

  • Install method: pip

  • Platform: Ubuntu 20.04, bare metal

matrixbot avatar Dec 18 '23 07:12 matrixbot

(From https://github.com/matrix-org/synapse/issues/8691#issuecomment-735861357)

This is a particular problem, because if your server spends a lot of time lagging behind, then you can end up receiving room events but never the e2e keys for those events

This is still a real problem, causing real UTDs. IMHO to-device events should be prioritised ahead of PDUs.

richvdh avatar Sep 23 '24 18:09 richvdh

This exacerbates https://github.com/matrix-org/matrix-spec/issues/1123

richvdh avatar Sep 23 '24 18:09 richvdh

We run into issues with federation between matrix.org and our homeserver (matrix.systemli.org) more and more often and guess that it's this very problem.

doobry-systemli avatar Feb 04 '25 11:02 doobry-systemli

Maybe it's time to focus on this issue, more and more homeservers are facing federation issues with matrix.org (my homeserver had this issue last year)

MomentQYC avatar Apr 10 '25 01:04 MomentQYC

I appear to have suffered this issue after failing to join Matrix HQ - my server doesn't think I'm in the room, so some Synapse servers (particularly matrix.org) in that room are failing to send me to_device messages because I'm "behind".

tcpipuk avatar Jun 12 '25 12:06 tcpipuk

is there anything i can do or can i only just wait until my server catches up?

ewof avatar Oct 02 '25 19:10 ewof