synapse
synapse copied to clipboard
Sudden CPU spike to 100% and disk space decreasing fast
Description
I have "btop" always running on my second monitor.. Unpinned a message in a room (from Element Web) and all of the sudden (not sure if related) server's CPU went crazy to 100% for a minute or so, until when I restarted the ansible playbook.
In this minute my DB has grown by 1GB more or less but I lost more than 2GB of space and I don't know where exactly. After that I ran the rust-compress utility and did a FULL VACUUM and DB is back to normal now.
Steps to reproduce
I don't have any step to reproduce, it happened out of the blue.
The only thing I was doing at that time was removing a pinned message from a room.
Homeserver
woodpeckersnest.space
Synapse Version
1.63.1
Installation Method
Other (please mention below)
Platform
I'm running the matrix-docker-ansible playbook on a Debian 11 VPS with 2GB RAM and 1 vCPU.
Relevant log output
Jul 21 21:42:11 matrix-synapse[1303919]: 2022-07-21 19:42:11,764 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655840531742 => stream_ordering 155865)
Anything else that would be useful to know?
No response
It happened again a few minutes ago but it didn't last very much..
Jul 23 13:30:51 pandora.woodpeckersnest.space matrix-synapse[90282]: 2022-07-23 11:30:51,176 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655983851151 => stream_ordering 161950)
Jul 23 13:30:52 pandora.woodpeckersnest.space matrix-synapse[90282]: 2022-07-23 11:30:52,165 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655983851188 => stream_ordering 161950)
Jul 23 13:30:52 pandora.woodpeckersnest.space matrix-synapse[90282]: 2022-07-23 11:30:52,192 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655983852167 => stream_ordering 161950)
Jul 23 13:30:52 pandora.woodpeckersnest.space matrix-synapse[90282]: 2022-07-23 11:30:52,220 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655983852193 => stream_ordering 161950)
I can see you have Synapse's experimental support for message retention policies enabled, can you see other log lines including [purge]
in your logs?
Also, what level are you logging at? From your report it sounds like WARNING
, which isn't really enough to figure out what Synapse is doing at a specific point in time (you usually want INFO
).
I would also recommend setting up monitoring with Prometheus and Grafana, as that would allow better insight into which part of Synapse is causing this issue, and would help us assisting you in case INFO
logs aren't enough to go from.
I just enabled INFO
logging but I can't see any new "PURGE" messages, guess I'll have to wait for message retention policies to kick in.
Also, can't run Prometheus and Grafana because of very limited server specs.
I will update this post when I see new "purge" messages.
Thanks
@babolivier https://gist.githubusercontent.com/roughnecks/d8e78fd29382f4a7cb7031f310b9bdeb/raw/f9581b9046b0faaf072a75aaa192e15bb0051653/2022-07-27_22-40-44.txt
Are you seeing the increased CPU and disk space issues during the period of time the logs are showing [purge]
messages?
It looks like the way this is set up is that every N hours (or however long you've set the interval
to) we look at every room with a retention policy and run a purge on them. That would probably explain the CPU usage if you have lots of such rooms.
I'm a bit more surprised by the disk space. Is it the DB that's growing? Does it keep growing on subsequent runs of the purge jobs?
Are you seeing the increased CPU and disk space issues during the period of time the logs are showing
[purge]
messages?
Yes.
I'm a bit more surprised by the disk space. Is it the DB that's growing? Does it keep growing on subsequent runs of the purge jobs?
Yeah, it keeps growing, but the most of it was done on the first purge with issues, subsequent ones were faster (in processing) and took less disk space. I have disabled messages retention now.
To me it looks like everything is about the same size but ./postgres, and I wouldn't know what other dirs to check:
root@pandora:/matrix# du -h --max-depth=1
24K ./mautrix-telegram
20K ./static-files
23M ./nginx-proxy
8.0K ./mailer
1.2M ./ssl
62M ./mautrix-whatsapp
8.0K ./coturn
36K ./client-element
2.1G ./synapse
667M ./postgres-backup
4.0K ./rust-synapse-compress-state
8.0K ./heisenbridge
292K ./matrix-reminder-bot
2.3G ./postgres
5.1G .
I think I've seen the database growing a bit in the past when running purge jobs but didn't take the time to investigate. I think it could be caused by the fact that we queue up events we might want to delete in a temporary table and don't vacuum after dropping it. In general purge jobs aren't expected to free up disk space on your machine unless you manually vacuum, but it's going to free up space Synapse will be able to use (as in, it will not give the space back to the system but it's going to mark the space as somewhere Synapse can write new data onto). See https://matrix-org.github.io/synapse/latest/message_retention_policies.html#note-on-reclaiming-disk-space for more information.
On the CPU side of things, as Erik implied it's kind of expected to see an increase when purge jobs run (which is the reason why we recommend they run at most once every several hours). I think there are a few quick improvements we can make to the purge code though, for example we check here how many events we need to delete, but we could also just skip all the way down to dropping the temporary table if that number is 0 (looking at the logs you shared this could alleviate a lot of the load given a fair amount of jobs are reporting 0 events to delete). And/or don't insert events we should not delete into that table.
Thanks for explaining, I did a VACUUM FULL
with the tools provided by the ansible playbook.
Hope you can improve the process 👍
For the paper trail, I've opened issues for the two improvement ideas I've mentioned in my previous comment: https://github.com/matrix-org/synapse/issues/13417 and https://github.com/matrix-org/synapse/issues/13419
I've seen history purges cause a big explosion in state groups in the past: https://github.com/matrix-org/synapse/issues/3364#issuecomment-395263702.