self-hosted
self-hosted copied to clipboard
Very high disk activity caused by Redis since upgrading from 22.06.0 to 22.10.0
Self-Hosted Version
22.10.0
CPU Architecture
x86_64
Docker Version
20.10.17, build 100c701
Docker Compose Version
1.29.2, build 5becea4c
Steps to Reproduce
We can't seem to reproduce the issue. It is happening on a semi regular basis, suggesting it might be to do with a cron job or something similar.
Expected Result
We have an AWS EC2 instance running the self hosted version of Sentry.
Since upgrading to 22.10 we have found that we get periodic, long, spikes of intense disk read activity. During these periods the EC2 instance becomes unusable and crashes. We must reboot the EC2 instance to recover.
This is a graph of Read Ops and Write Ops combined:
And just of Read Ops:
The instance is unreachable via SSH, although Amazon says the instance is still up. ping
is failing, and our AWS ALB health check reports unhealthy too.
We have upgraded from a gp2 instance to gp3 to see if that will help, but as yet no luck.
We have fed all our logs for Sentry into Cloudwatch but have not been able to spot an obvious candidate as to the cause for this problem. We would appreciate some guidance on where to look.
So far, some patterns and thoughts we have identified:
- We use Sentry Relay, and have observed that we get a significant number of requests around the point the issue occurs. However, this appears to be a spike due to the problem, rather than causing the problem.
- We do see errors in the logs occasionally about redis, like:
redis.exceptions.ResponseError: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.
However, these appear to happen after the instance has restarted and do not appear to be the cause.
- We believe we're seeing errors like this one as a result of Sentry restarting/recovering, but we also believe these are red herrings
- Our disk is a 500gb EBS gp3 instance, with approximately 50% free disk space available.
Actual Result
We would expect Sentry (or dependent tools) not to hit disk as intensively, and not to crash :)
Event ID
No response