synapse Sudden CPU spike to 100% and disk space decreasing fast

Description

I have "btop" always running on my second monitor.. Unpinned a message in a room (from Element Web) and all of the sudden (not sure if related) server's CPU went crazy to 100% for a minute or so, until when I restarted the ansible playbook.

In this minute my DB has grown by 1GB more or less but I lost more than 2GB of space and I don't know where exactly. After that I ran the rust-compress utility and did a FULL VACUUM and DB is back to normal now.

Steps to reproduce

I don't have any step to reproduce, it happened out of the blue.

The only thing I was doing at that time was removing a pinned message from a room.

Homeserver

woodpeckersnest.space

Synapse Version

1.63.1

Installation Method

Other (please mention below)

Platform

I'm running the matrix-docker-ansible playbook on a Debian 11 VPS with 2GB RAM and 1 vCPU.

Relevant log output

Jul 21 21:42:11 matrix-synapse[1303919]: 2022-07-21 19:42:11,764 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655840531742 => stream_ordering 155865)

Anything else that would be useful to know?

No response

Jul 22 '22 11:07 roughnecks

It happened again a few minutes ago but it didn't last very much..

Jul 23 13:30:51 pandora.woodpeckersnest.space matrix-synapse[90282]: 2022-07-23 11:30:51,176 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655983851151 => stream_ordering 161950)
Jul 23 13:30:52 pandora.woodpeckersnest.space matrix-synapse[90282]: 2022-07-23 11:30:52,165 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655983851188 => stream_ordering 161950)
Jul 23 13:30:52 pandora.woodpeckersnest.space matrix-synapse[90282]: 2022-07-23 11:30:52,192 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655983852167 => stream_ordering 161950)
Jul 23 13:30:52 pandora.woodpeckersnest.space matrix-synapse[90282]: 2022-07-23 11:30:52,220 - synapse.handlers.pagination - 274 - WARNING - purge_history_for_rooms_in_range-0 - [purge] purging events not possible: No event found (ts 1655983852193 => stream_ordering 161950)

Jul 23 '22 11:07 roughnecks

I can see you have Synapse's experimental support for message retention policies enabled, can you see other log lines including [purge] in your logs?

Also, what level are you logging at? From your report it sounds like WARNING, which isn't really enough to figure out what Synapse is doing at a specific point in time (you usually want INFO).

I would also recommend setting up monitoring with Prometheus and Grafana, as that would allow better insight into which part of Synapse is causing this issue, and would help us assisting you in case INFO logs aren't enough to go from.

Jul 25 '22 15:07 babolivier

I just enabled INFO logging but I can't see any new "PURGE" messages, guess I'll have to wait for message retention policies to kick in.

Also, can't run Prometheus and Grafana because of very limited server specs.

I will update this post when I see new "purge" messages.

Thanks

Jul 25 '22 16:07 roughnecks

@babolivier https://gist.githubusercontent.com/roughnecks/d8e78fd29382f4a7cb7031f310b9bdeb/raw/f9581b9046b0faaf072a75aaa192e15bb0051653/2022-07-27_22-40-44.txt

Jul 27 '22 20:07 roughnecks

Are you seeing the increased CPU and disk space issues during the period of time the logs are showing [purge] messages?

It looks like the way this is set up is that every N hours (or however long you've set the interval to) we look at every room with a retention policy and run a purge on them. That would probably explain the CPU usage if you have lots of such rooms.

I'm a bit more surprised by the disk space. Is it the DB that's growing? Does it keep growing on subsequent runs of the purge jobs?

Jul 29 '22 10:07 erikjohnston

Are you seeing the increased CPU and disk space issues during the period of time the logs are showing [purge] messages?

Yes.

I'm a bit more surprised by the disk space. Is it the DB that's growing? Does it keep growing on subsequent runs of the purge jobs?

Yeah, it keeps growing, but the most of it was done on the first purge with issues, subsequent ones were faster (in processing) and took less disk space. I have disabled messages retention now.

To me it looks like everything is about the same size but ./postgres, and I wouldn't know what other dirs to check:

root@pandora:/matrix# du -h --max-depth=1
24K     ./mautrix-telegram
20K     ./static-files
23M     ./nginx-proxy
8.0K    ./mailer
1.2M    ./ssl
62M     ./mautrix-whatsapp
8.0K    ./coturn
36K     ./client-element
2.1G    ./synapse
667M    ./postgres-backup
4.0K    ./rust-synapse-compress-state
8.0K    ./heisenbridge
292K    ./matrix-reminder-bot
2.3G    ./postgres
5.1G    .

Jul 29 '22 10:07 roughnecks

I think I've seen the database growing a bit in the past when running purge jobs but didn't take the time to investigate. I think it could be caused by the fact that we queue up events we might want to delete in a temporary table and don't vacuum after dropping it. In general purge jobs aren't expected to free up disk space on your machine unless you manually vacuum, but it's going to free up space Synapse will be able to use (as in, it will not give the space back to the system but it's going to mark the space as somewhere Synapse can write new data onto). See https://matrix-org.github.io/synapse/latest/message_retention_policies.html#note-on-reclaiming-disk-space for more information.

On the CPU side of things, as Erik implied it's kind of expected to see an increase when purge jobs run (which is the reason why we recommend they run at most once every several hours). I think there are a few quick improvements we can make to the purge code though, for example we check here how many events we need to delete, but we could also just skip all the way down to dropping the temporary table if that number is 0 (looking at the logs you shared this could alleviate a lot of the load given a fair amount of jobs are reporting 0 events to delete). And/or don't insert events we should not delete into that table.

Jul 29 '22 12:07 babolivier

Thanks for explaining, I did a VACUUM FULL with the tools provided by the ansible playbook. Hope you can improve the process 👍

Jul 29 '22 12:07 roughnecks

For the paper trail, I've opened issues for the two improvement ideas I've mentioned in my previous comment: https://github.com/matrix-org/synapse/issues/13417 and https://github.com/matrix-org/synapse/issues/13419

Jul 29 '22 12:07 babolivier

I've seen history purges cause a big explosion in state groups in the past: https://github.com/matrix-org/synapse/issues/3364#issuecomment-395263702.

Jul 29 '22 12:07 richvdh

synapse synapse copied to clipboard

Sudden CPU spike to 100% and disk space decreasing fast

Description

Steps to reproduce

Homeserver

Synapse Version

Installation Method

Platform

Relevant log output

Anything else that would be useful to know?

synapse
synapse copied to clipboard