[Feature] Prometheus alert for failing backups
It would be great if there was a Prometheus alert for failing backups.
It has just happened to me that backups were not completed successfully. In such cases, it would be great to be alerted automatically without having to set up an alert yourself.
I just added this rule:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cnpg-alert-backup
spec:
groups:
- name: cnpg
rules:
- alert: BackupFailed
annotations:
description: |
Last available backup for the CNPG cluster in {{ $labels.namespace }} is older than 2 hours: {{ $value }} hours
summary: The last available backup is older than 2 hours.
expr: |
(time() - sum by (namespace) (cnpg_collector_last_available_backup_timestamp{pod=~"cluster-([1-9][0-9]*)$"})) / 3600 > 2
for: 0s
labels:
severity: warning
This groups by namespace and assumes that there is only one cluster per namespace. Works for my requirements. But it certainly can be better if a rule is included in the chart in which one alert per cluster is added.
I'll implement this rule, but I am going to infer the value from the scheduled backups section with a much more reasonable default of sth like 2*schedule - 48h by default.
P.S. Base backups every 2h seems a bit excessive, but I imagine you generate a lot of WALs and want to decrease your RTO.
That sounds great, thank you very much.
2 * schedule sounds reasonable.
If I can help in any way, please let me know.
It's OT here, but briefly about the backup times: You are right. I have looked at the documentation on backups again. I can significantly reduce the frequency of backups. I wanted to keep the RPO as low as possible. This is realized by the WAL backup. RTO is not so important.
Hi, @mrclrchtr. I'm Dosu, and I'm helping the charts team manage their backlog. I'm marking this issue as stale.
Issue Summary:
- You requested a Prometheus alert feature for failing backups.
- You shared a custom rule for alerts when backups are older than two hours.
- @itay-grudev agreed to implement a similar rule with a default of 48 hours.
- You expressed agreement with this approach and offered further assistance.
Next Steps:
- Please confirm if this issue is still relevant to the latest version of the charts repository. If so, you can keep the discussion open by commenting here.
- Otherwise, this issue will be automatically closed in 7 days.
Thank you for your understanding and contribution!
I implemented a solution on my own, but I still think, this would be a very useful alert for others, too.
@itay-grudev, the user @mrclrchtr has implemented their own solution but believes that the Prometheus alert feature for failing backups would still be beneficial for others. Could you please assist with this issue?