velero velero_backup_success_total metric keeps being reported for deleted schedules

What steps did you take and what happened:

Create a schedule, let it perform a few backups, then delete the schedule.

The /metrics endpoint keeps reporting the velero_backup_success_total{schedule=""} metric, even after all the backups have expired, until velero itself is restarted.

What did you expect to happen:

The metric should vanish when the schedule is deleted or after all its backups have expired, or it should stay around indefinitely (which I wouldn't prefer). In any case, the fact that it goes away when the velero pod is restarted suggests that this an artifact of the implementation rather than desired behaviour, and it does make it harder to implement certain alerting rules.

Mar 29 '19 19:03 multi-io

Same thing happens with the ark_backup_failure_total metric.

Apr 04 '19 20:04 maberny

Hi, I really love velero. It does really great for its principal purpose : backups and restores.

But, exposing nothing at all is better than incorrect metrics. Can you, either :

advertise that your metrics are not entireiy ok,
correct or remove those which are not

Thanks.

Sep 18 '20 14:09 h4wkmoon

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jul 08 '21 15:07 stale[bot]

Closing the stale issue.

Jul 22 '21 18:07 stale[bot]

Velero still has this issue. I think it should be reopen.

Oct 18 '21 14:10 h4wkmoon

The issue is still valid for the latest version!

@eleanor-millman can you re-open?

Jan 26 '22 08:01 sbkg0002

Sure!

Feb 11 '22 21:02 eleanor-millman

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Apr 13 '22 12:04 stale[bot]

Bump since this is still an issue.

Apr 14 '22 08:04 sbkg0002

Really interested by a fix for this issue as well, for years now. I'm considering writing a new exporter but that would be counterproductive, thus I lack golang skills to properly improve Velero's one with confidence.

May 02 '22 04:05 guillaumefenollar

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jul 06 '22 23:07 stale[bot]

Is there anyone with a possible fix? This keeps spamming our monitoring.

Jul 07 '22 08:07 sbkg0002

The workourand is still to delete velero pods. Be sure to run it when there are no backups.

Jul 07 '22 08:07 h4wkmoon

up

Aug 05 '22 16:08 LuckySB

Still a problem :(

Sep 26 '22 09:09 KlavsKlavsen

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Nov 26 '22 17:11 stale[bot]

Closing the stale issue.

Dec 11 '22 16:12 stale[bot]

@eleanor-millman this issue is still an open bug - can you re-open and ask stale-bot to not close it, just because no one has solved it yet?

Dec 12 '22 09:12 KlavsKlavsen

We have the same problem, please reopen

May 26 '23 13:05 cwrau

Just faced the same issue. Deleting the pod works but it's a workaround, can it be re opened?

Jun 21 '23 07:06 lorenzomorandini

Currently, it is impossible to effectively monitor the status of backups. The important thing that should be fixed.

Jun 28 '23 06:06 neurobion

@Neurobion, Since the original problem has been reported quite a while ago, would you mind describing it again after testing with latest Velero? We at CloudCasa will see if we can provide a fix if the problem still exists.

Jun 28 '23 19:06 draghuram

I use tag: v1.11.0 + velero-plugin-for-microsoft-azure:v1.6.0 and deploy through Helm. Everything works as it should, but at the moment of redeploying or deleting a specific schedule, this metric still returns it, which is not correct for monitoring purposes. I only need backup information only for existing schedules.

Jiris-MacBook-Pro:velero xx$ velero get backup
NAME                                                        STATUS      ERRORS   WARNINGS   CREATED                          EXPIRES   STORAGE LOCATION   SELECTOR
man-backup                                                  Completed   0        0          2023-06-21 15:51:17 +0200 CEST   22d       default            <none>
velero-1687355039-XX-20230627103052   Completed   0        0          2023-06-27 12:30:52 +0200 CEST   28d       default            <none>
velero-1687355039-XX-20230626103051   Completed   0        0          2023-06-26 12:30:51 +0200 CEST   27d       default            <none>
velero-XX-pv-20230628103002              Completed   0        0          2023-06-28 12:30:02 +0200 CEST   29d       default            <none>

Jiris-MacBook-Pro:velero xx$ velero get schedules
NAME                              STATUS    CREATED                          SCHEDULE      BACKUP TTL   LAST BACKUP   SELECTOR   PAUSED
velero-XX-pv   Enabled   2023-06-28 09:32:51 +0200 CEST   30 10 * * *   720h0m0s     22h ago       <none>     false

velero_backup_last_successful_timestamp return:

velero_backup_last_successful_timestamp{endpoint="http-monitoring", instance="10.142.118.80:8085", job="velero", namespace="velero", pod="velero-69d5544f8f-pbwcc", schedule="velero-1687355039-XX", service="velero"} | 1687861964
velero_backup_last_successful_timestamp{endpoint="http-monitoring", instance="10.142.118.80:8085", job="velero", namespace="velero", pod="velero-69d5544f8f-pbwcc", schedule="velero-XX-pv", service="velero"} | 1687948312

In my opinion velero-1687355039-XX has nothing to do there because the schedule no longer exists.

And in case of set alerting ((time() - velero_backup_last_successful_timestamp) / 60 / 60 > 24) it doesn't work as it should.

PS: If that is not the goal of this metric, then another metric would be useful that would only take into account backups according to the current list of schedules.

Jun 29 '23 08:06 neurobion

Agree with @Neurobion, I set up alerts recently for Velero backups, just to find out that they don't work as expected. We need a fix, as these metrics are not reliable for monitoring purposes.

Jul 04 '23 12:07 jmuleiro

I've been looking for a workaround to this issue for hours. I checked the code (be aware I can barely understand Golang) and AFAIK this problem can be attributed to the backup finalizer controller, these lines in particular:

backupScheduleName := backupRequest.GetLabels()[velerov1api.ScheduleNameLabel]
switch backup.Status.Phase {
case velerov1api.BackupPhaseFinalizing:
	backup.Status.Phase = velerov1api.BackupPhaseCompleted
	r.metrics.RegisterBackupSuccess(backupScheduleName)
	r.metrics.RegisterBackupLastStatus(backupScheduleName, metrics.BackupLastStatusSucc)
case velerov1api.BackupPhaseFinalizingPartiallyFailed:
	backup.Status.Phase = velerov1api.BackupPhasePartiallyFailed
	r.metrics.RegisterBackupPartialFailure(backupScheduleName)
	r.metrics.RegisterBackupLastStatus(backupScheduleName, metrics.BackupLastStatusFailure)
}
backup.Status.CompletionTimestamp = &metav1.Time{Time: r.clock.Now()}
recordBackupMetrics(log, backup, outBackupFile, r.metrics, true)

It seems that the finalizer controller is the starting point when it comes to backup deletion. Backup CRDs get marked with a finalizer and subsequentially a deletionTimestamp in Kubernetes, and then Velero starts the backup deletion process. The backup deletion itself could still be unsuccessful or get stuck somehow - this happened to me just yesterday - but the controller will still mark what actually is a deletion attempt as a successful backup. It doesn't make sense.

It may very well be that they did this instead of implementing new metrics and updating the Kubernetes CRDs to support these new backup status phases. Nonetheless, because this is the current behavior, the metrics being updated by the finalizer controller are rendered essentially useless.

In my opinion, it would be great if the team could release a hotfix, deleting the lines pointed out above as most of us desperately need this fixed to be able to set up alerts based on these metric sets. If they are accepting pull requests, some of us would be willing to contribute to get this fixed as soon as possible.

Jul 04 '23 16:07 jmuleiro

Thanks @Neurobion. We (at CloudCasa) will try to fix the issue.

Jul 06 '23 19:07 draghuram

Currently, Velero does not clear the Prometheus counter after the schedule gets deleted. In the schedule's reconciler, we can add the following code to delete the counters, but that's not sufficient. Because these counters are updated from the backup reconcilers too.

if c, ok := m.metrics[backupAttemptTotal].(*prometheus.CounterVec); ok {
    c.DeleteLabelValues(scheduleName)
}

Aug 03 '23 13:08 nilesh-akhade

Hello @draghuram, any news on this issue?

Aug 17 '23 15:08 jmuleiro

@nilesh-akhade created a PR that we are internally reviewing before submitting to Velero. Will post here when it is ready.

Aug 17 '23 18:08 draghuram

velero velero copied to clipboard

velero_backup_success_total metric keeps being reported for deleted schedules

velero
velero copied to clipboard