grafana icon indicating copy to clipboard operation
grafana copied to clipboard

Alerting: Add mutex to Redis HA subs

Open cinkster opened this issue 1 year ago • 3 comments

What is this feature? Adds a mutex and the related locking around access to the redis_peer's sub map.

Why do we need this feature?

Fixes a crash fatal error: concurrent map read and map write that can occur during startup when using redis for HA alerting.

Who is this feature for?

Anyone using redis for HA alerting, especially anyone with a large number of orgs/alerts.

Which issue(s) does this PR fix?:

None I was able to find, but this is most likely due to the edge case nature. I was able to replicate the crash fairly regularly (about 30% of attempted startups) on a test instance with approximately 1000 dummy organisations.

Special notes for your reviewer:

Example crash log prior to this fix, cropped for brevity:

| fatal error: concurrent map read and map write
| 
| goroutine 187 [running]:
| github.com/grafana/grafana/pkg/services/ngalert/notifier.(*redisPeer).fullStateReqReceiveLoop(0xc000706690)
| 	/grafana/pkg/services/ngalert/notifier/redis_peer.go:448 +0x49
| created by github.com/grafana/grafana/pkg/services/ngalert/notifier.newRedisPeer
| 	/grafana/pkg/services/ngalert/notifier/redis_peer.go:197 +0x14bf
| 
| goroutine 1 [runnable]:
| github.com/grafana/grafana/pkg/services/ngalert/notifier.(*redisPeer).AddState(0xc000706690, {0xc002ffeeb8, 0x12}, {0x41afbe8?, 0xc00328ae00}, {0x0?, 0x3a2d051?})
| 	/grafana/pkg/services/ngalert/notifier/redis_peer.go:402 +0x245
| github.com/grafana/grafana/pkg/services/ngalert/notifier.newAlertmanager({0x41bd300?, 0xc000130020}, 0x5a, 0xc000ba6400, {0x41d0af8?, 0xc000c72ea0}, {0x41c30c0, 0xc0007558a0}, {0x41b7cb0, 0xc000706690}, ...)
| 	/grafana/pkg/services/ngalert/notifier/alertmanager.go:179 +0x817
| github.com/grafana/grafana/pkg/services/ngalert/notifier.(*MultiOrgAlertmanager).SyncAlertmanagersForOrgs(0xc0008d9d40, {0x41bd300, 0xc000130020}, {0xc0014aa000, 0xaaf, 0xc00093a070?})
| 	/grafana/pkg/services/ngalert/notifier/multiorg_alertmanager.go:215 +0x3b6
| github.com/grafana/grafana/pkg/services/ngalert/notifier.(*MultiOrgAlertmanager).LoadAndSyncAlertmanagersForOrgs(0xc0008d9d40, {0x41bd300, 0xc000130020})
| 	/grafana/pkg/services/ngalert/notifier/multiorg_alertmanager.go:170 +0xd8
| github.com/grafana/grafana/pkg/services/ngalert.(*AlertNG).init(0xc00036f380)
| 	/grafana/pkg/services/ngalert/ngalert.go:163 +0x425

Please check that:

  • [ ] It works as expected from a user's perspective.
  • [ ] If this is a pre-GA feature, it is behind a feature toggle.
  • [ ] The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

cinkster avatar Jun 28 '24 05:06 cinkster

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update or ping for review. Thank you for your contributions!

github-actions[bot] avatar Aug 02 '24 01:08 github-actions[bot]

This pull request has been automatically closed because it has not had any further activity in the last 2 weeks. Thank you for your contributions!

github-actions[bot] avatar Aug 16 '24 01:08 github-actions[bot]

Ping @fayzal-g . Sorry I missed the stale reminder, but this is a process crashing bug that shouldn't be auto-closed.

cinkster avatar Aug 16 '24 02:08 cinkster