etcd-operator icon indicating copy to clipboard operation
etcd-operator copied to clipboard

Prometheus metrics for backups

Open jescarri opened this issue 5 years ago • 6 comments

Currently there's no exporter for the etcd-backup-operator.

Creating this issue to link it to a PR.

jescarri avatar Jun 17 '19 20:06 jescarri

as a side note, I took your branch and my branch and built my own backup operator image, works great.

jurgenweber avatar Jul 09 '19 00:07 jurgenweber

@jurgenweber yes, it's being running in our clusters for a few weeks w/o problems :)

Thanks for testing it!

jescarri avatar Jul 09 '19 19:07 jescarri

Do you have any prometheus alerts/grafana dashboards you mind sharing?

Also I am finding, if the pod gets restarted the metric will disappear until a new backup is run. You can see the metrics endpoint no longer has etcd_operator_backup.* metrics, but others still do return. I think it will need to return all the time, even if it has no value. Thoughts?

jurgenweber avatar Jul 10 '19 02:07 jurgenweber

@rjtsdl sure, I can do that.

I was planning to add readiness / liveness probes later, but you are right, simple handlers can do the trick.

jescarri avatar Jul 10 '19 06:07 jescarri

@jurgenweber this is what we have right now:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
  labels:
    prometheus: k8s
    role: alert-rules
  name: etcd-backup
spec:
  groups:
  - name: etcd-backup
    rules:
    - alert: etcdBackupControllerDown
      annotations:
        summary: etcd-backup pod {{ $labels.kubernetes_pod_name }} has
          been down for 5 minutes
      expr: absent(up{app="etcd-backup-operator"}) == 1
      for: 5m
      labels:
        class: availability
        severity: p1
    - alert: etcdBackupsNOTAttempted
      annotations:
        summary: No etcd-backups hasn't been attempted for the past 30 min
      expr: rate(etcd_operator_backups_attempt_total[30m]) * 1800 < 2
      for: 5m
      labels:
        class: availability
        severity: p2
    - alert: etcdBackupsNOTSucceeding
      annotations:
        summary: No etcd-backups have succeeded the past 30 min
      expr: rate(etcd_operator_backups_success_total[30m]) * 1800 < 2
      for: 5m
      labels:
        class: availability
        severity: p2

jescarri avatar Jul 11 '19 04:07 jescarri

yeah, my schedule is one an hour:

        - alert: VaultEtcdLastBackup
          annotations:
            summary: The last backup was more than 1 hour ago, please check it
            description: "vault etcd {{ $labels.instance }} backup too old"
          expr: time() - etcd_operator_backup_last_success{name="vault-etcd-cluster-backup",namespace="devops",release="amazing-dog"} > 3700
          for: 10m
          labels:
            severity: critical
        - alert: VaultEtcdBackupFailed
          annotations:
            summary: The backup has failed, we check for the last 3 successful backup attempts. Check that it is work.
            description: "vault etcd {{ $labels.instance }} backup has failed"
          expr: increase(etcd_operator_backups_success_total{name="vault-etcd-cluster-backup",namespace="devops",release="amazing-dog"}[3h]) == 3
          for: 10m
          labels:
            severity: critical

jurgenweber avatar Jul 11 '19 05:07 jurgenweber