cortex
cortex copied to clipboard
Single corruptted rule group in S3 will cause the ruler's all other tenants fail to update rule groups
Describe the bug The load rule groups load the rule groups file for all tenants that one ruler is responsible for. Which means if one rule group file is corrupted, it will cause all rule groups for all tenants in this ruler fail to update see: https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler.go#L483
Expected behavior The tenants should be isolated with each other, the rule groups file corrupted with one tenant shouldn't stop the other tenant from updating
Environment: Infrastructure: AWS EKS Deployment tool: helm
Storage Engine
- [x ] Blocks
- [ ] Chunks
Thank you for reporting this @qinxx108. Looking at the code, I'm assuming that corruption, in this case, means we've failed to list/download the rule group.
In that case, it is true that it'll bail out of the sync process instead of returning the configs we were able to list download. Out of curiosity, could you share a bit more detail on how that failure looks like? An example of a "corrupted file" would work wonders.
https://github.com/cortexproject/cortex/blob/0c091e68c33dd11f6bdc2fdfbd33c9077db3b70a/pkg/ruler/ruler.go#L471-L489
@pracucci I think we should change the semantics of listRules and loadRuleGroups to skip a tenant if it fails. WDYT?
I'm assuming that corruption, in this case, means we've failed to list/download the rule group.
The listing is triggered by r.listRules(). Rule groups content is not downloaded and/or decoded by r.listRules(). Listing failures could be temporarily and, in my opinion, we should treat listing as atomic: either the full listing succeed or the sync is skipped at all.
However, the issue could be in store.LoadRuleGroups(): this function returns error if at least 1 rule group failed to download/decode and, when return error, the whole Ruler.syncRules() just exits. What we could do is to improve both LoadRuleGroups() and syncRules() so that a single failure doesn't prevent the whole ruler to sync.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.
Still valid
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.
I don't think this is fixed, reopening.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.