charts icon indicating copy to clipboard operation
charts copied to clipboard

[Feature] Prometheus alert for failing backups

Open mrclrchtr opened this issue 1 year ago • 3 comments

It would be great if there was a Prometheus alert for failing backups.

It has just happened to me that backups were not completed successfully. In such cases, it would be great to be alerted automatically without having to set up an alert yourself.

mrclrchtr avatar Jul 16 '24 15:07 mrclrchtr

I just added this rule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cnpg-alert-backup
spec:
  groups:
    - name: cnpg
      rules:
        - alert: BackupFailed
          annotations:
            description: |
              Last available backup for the CNPG cluster in {{ $labels.namespace }} is older than 2 hours: {{ $value }} hours
            summary: The last available backup is older than 2 hours.
          expr: |
            (time() - sum by (namespace) (cnpg_collector_last_available_backup_timestamp{pod=~"cluster-([1-9][0-9]*)$"})) / 3600 > 2
          for: 0s
          labels:
            severity: warning

This groups by namespace and assumes that there is only one cluster per namespace. Works for my requirements. But it certainly can be better if a rule is included in the chart in which one alert per cluster is added.

mrclrchtr avatar Jul 18 '24 10:07 mrclrchtr

I'll implement this rule, but I am going to infer the value from the scheduled backups section with a much more reasonable default of sth like 2*schedule - 48h by default.

P.S. Base backups every 2h seems a bit excessive, but I imagine you generate a lot of WALs and want to decrease your RTO.

itay-grudev avatar Jul 24 '24 16:07 itay-grudev

That sounds great, thank you very much.

2 * schedule sounds reasonable.

If I can help in any way, please let me know.

It's OT here, but briefly about the backup times: You are right. I have looked at the documentation on backups again. I can significantly reduce the frequency of backups. I wanted to keep the RPO as low as possible. This is realized by the WAL backup. RTO is not so important.

mrclrchtr avatar Jul 25 '24 09:07 mrclrchtr

Hi, @mrclrchtr. I'm Dosu, and I'm helping the charts team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • You requested a Prometheus alert feature for failing backups.
  • You shared a custom rule for alerts when backups are older than two hours.
  • @itay-grudev agreed to implement a similar rule with a default of 48 hours.
  • You expressed agreement with this approach and offered further assistance.

Next Steps:

  • Please confirm if this issue is still relevant to the latest version of the charts repository. If so, you can keep the discussion open by commenting here.
  • Otherwise, this issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

dosubot[bot] avatar Apr 09 '25 16:04 dosubot[bot]

I implemented a solution on my own, but I still think, this would be a very useful alert for others, too.

mrclrchtr avatar Apr 09 '25 17:04 mrclrchtr

@itay-grudev, the user @mrclrchtr has implemented their own solution but believes that the Prometheus alert feature for failing backups would still be beneficial for others. Could you please assist with this issue?

dosubot[bot] avatar Apr 09 '25 17:04 dosubot[bot]