self-hosted icon indicating copy to clipboard operation
self-hosted copied to clipboard

Very high disk activity caused by Redis since upgrading from 22.06.0 to 22.10.0

Open edhgoose opened this issue 2 years ago • 24 comments

Self-Hosted Version

22.10.0

CPU Architecture

x86_64

Docker Version

20.10.17, build 100c701

Docker Compose Version

1.29.2, build 5becea4c

Steps to Reproduce

We can't seem to reproduce the issue. It is happening on a semi regular basis, suggesting it might be to do with a cron job or something similar.

Expected Result

We have an AWS EC2 instance running the self hosted version of Sentry.

Since upgrading to 22.10 we have found that we get periodic, long, spikes of intense disk read activity. During these periods the EC2 instance becomes unusable and crashes. We must reboot the EC2 instance to recover.

This is a graph of Read Ops and Write Ops combined: CleanShot 2022-11-02 at 20 21 35@2x

And just of Read Ops: CleanShot 2022-11-02 at 20 22 12@2x

The instance is unreachable via SSH, although Amazon says the instance is still up. ping is failing, and our AWS ALB health check reports unhealthy too.

We have upgraded from a gp2 instance to gp3 to see if that will help, but as yet no luck.

We have fed all our logs for Sentry into Cloudwatch but have not been able to spot an obvious candidate as to the cause for this problem. We would appreciate some guidance on where to look.

So far, some patterns and thoughts we have identified:

  • We use Sentry Relay, and have observed that we get a significant number of requests around the point the issue occurs. However, this appears to be a spike due to the problem, rather than causing the problem.
  • We do see errors in the logs occasionally about redis, like:

redis.exceptions.ResponseError: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.

However, these appear to happen after the instance has restarted and do not appear to be the cause.

  • We believe we're seeing errors like this one as a result of Sentry restarting/recovering, but we also believe these are red herrings
  • Our disk is a 500gb EBS gp3 instance, with approximately 50% free disk space available.

Actual Result

We would expect Sentry (or dependent tools) not to hit disk as intensively, and not to crash :)

Event ID

No response

edhgoose avatar Nov 02 '22 20:11 edhgoose