synapse
synapse copied to clipboard
replication timeouts due to message retention purge jobs
Description
Assumingly due to https://github.com/matrix-org/synapse/pull/13632, the master process is unable to handle replication requests by workers due to the load from purge jobs. It is happily logging updates on the purge job states while clients can't connect anymore.
Steps to reproduce
- enable message retention
- maybe be a big instance idk
- wait for the scheduled job to execute
Homeserver
tchncs.de
Synapse Version
1.94.0
Installation Method
pip (from PyPI)
Database
PostgreSQL
Workers
Multiple workers
Platform
Debian GNU/Linux 12 (bookworm), dedicated
Configuration
draupnir module, presence, retention
Relevant log output
synapse.replication.tcp.client - 352 - INFO - _process_incoming_pdus_in_room_inner-124023-$fbrT_6mck678v_gNV527V0f5Jp4kvbDiQVSeHOmiN2E - Finished waiting for repl stream 'events' to reach 361593234 (event_persister1)
synapse.http.client - 923 - INFO - PUT-890470 - Received response to POST synapse-replication://master/_synapse/replication/fed_send_edu/m.receipt/IjFSBKBxIa: 200
synapse.replication.tcp.client - 332 - INFO - PUT-890470 - Waiting for repl stream 'caches' to reach 416737455 (master); currently at: 416710210
synapse.replication.tcp.client - 342 - WARNING - PUT-890464 - Timed out waiting for repl stream 'caches' to reach 416737417 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.http._base - 300 - WARNING - GET-2559861 - presence_set_state request timed out; retrying
synapse.replication.http._base - 312 - WARNING - PUT-899550 - fed_send_edu request connection failed; retrying in 1s: ConnectError(<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>)
synapse.http.client - 932 - INFO - PUT-901284 - Error sending request to POST synapse-replication://master/_synapse/replication/fed_send_edu/m.presence/WCoECfmCdH: ConnectError [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
]
I think this issue is mostly that "message retention is expensive and must be done on the main process"?
It seems like there's a few issues here though:
- Message retention purging being expensive.
- Message retention being handled by the main process.
- Replication falling behind due to message retention using all CPU.
I'm unsure of the best approach to solve this, or if all 3 are somewhat needed.
We want to give our users a message retention promise, but its currently not possible on large servers with databases in the hundreds gigabytes.
Message retention purging being expensive.
I'm not sure this is solvable. Purging involves touching potentially lots of rows.
It seems to me there are a couple possibilities:
- Leave purge jobs on the primary process, but cap their run time or do some sort of coroutine style yielding to allow replication traffic to get cpu time. (There is prior art in limiting background jobs to a certain duration)
- Move the purge job to a worker. I know this was tried before but was switched back to the primary process for what seems like lack of dev time, rather than a technical reason (though obviously communication happens outside Github issues)
- This would require moving the
purge_history
function mentioned in the above issue to a place where workers can access it, thereby moving the purge job off the main process
- This would require moving the
But since I'm not as familiar with synapse's internals as the devs I can't say which is the most attractive. But this issue is very painful for us and its difficult to explain to stakeholders that despite this feature being well documented, it in fact breaks the server and has been like that for quite a while.
This recent commit removing the experimental warning from retention is not a good idea IMO. While it's fabulous there is no longer risk of corruption bugs, as this issue establishes, there is a serious bug in the retention feature as it exists: your server will essentially stop working (if you have a large number of events that need purging).
fb664 Remove warnings from the docs about using message retention.