cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Single corruptted rule group in S3 will cause the ruler's all other tenants fail to update rule groups

Open qinxx108 opened this issue 4 years ago • 7 comments
trafficstars

Describe the bug The load rule groups load the rule groups file for all tenants that one ruler is responsible for. Which means if one rule group file is corrupted, it will cause all rule groups for all tenants in this ruler fail to update see: https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler.go#L483

Expected behavior The tenants should be isolated with each other, the rule groups file corrupted with one tenant shouldn't stop the other tenant from updating

Environment: Infrastructure: AWS EKS Deployment tool: helm

Storage Engine

  • [x ] Blocks
  • [ ] Chunks

qinxx108 avatar May 13 '21 00:05 qinxx108

Thank you for reporting this @qinxx108. Looking at the code, I'm assuming that corruption, in this case, means we've failed to list/download the rule group.

In that case, it is true that it'll bail out of the sync process instead of returning the configs we were able to list download. Out of curiosity, could you share a bit more detail on how that failure looks like? An example of a "corrupted file" would work wonders.

https://github.com/cortexproject/cortex/blob/0c091e68c33dd11f6bdc2fdfbd33c9077db3b70a/pkg/ruler/ruler.go#L471-L489

@pracucci I think we should change the semantics of listRules and loadRuleGroups to skip a tenant if it fails. WDYT?

gotjosh avatar May 13 '21 11:05 gotjosh

I'm assuming that corruption, in this case, means we've failed to list/download the rule group.

The listing is triggered by r.listRules(). Rule groups content is not downloaded and/or decoded by r.listRules(). Listing failures could be temporarily and, in my opinion, we should treat listing as atomic: either the full listing succeed or the sync is skipped at all.

However, the issue could be in store.LoadRuleGroups(): this function returns error if at least 1 rule group failed to download/decode and, when return error, the whole Ruler.syncRules() just exits. What we could do is to improve both LoadRuleGroups() and syncRules() so that a single failure doesn't prevent the whole ruler to sync.

pracucci avatar May 13 '21 13:05 pracucci

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 12 '21 01:08 stale[bot]

Still valid

pracucci avatar Aug 12 '21 06:08 pracucci

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 11 '21 10:11 stale[bot]

I don't think this is fixed, reopening.

alvinlin123 avatar Apr 26 '22 21:04 alvinlin123

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 12 '22 01:08 stale[bot]