velero icon indicating copy to clipboard operation
velero copied to clipboard

[Epic] support running multiple velero backups/restores concurrently

Open ncdc opened this issue 7 years ago • 17 comments

We need to account for a few things:

  1. If 2 server pods run simultaneously, they both might get the same Backup or Restore in a New state. They would both attempt to change the state to InProgress, they would likely both succeed, resulting in undesirable behavior.
  2. If a Backup or Restore is InProgress and the server terminates for any reason (scaled down, crashes, terminates normally), we ideally need to be able to have the replacement server process pick up whatever was in progress instead of having it linger.

ncdc avatar May 14 '18 18:05 ncdc

From a use-ability perspective, I think this is a particularly important issue for us to tackle. There are a number of reasons why two Server pods might be running simultaneously and we need to handle that gracefully.

In addition, there's lots of ways an InProgress backup get stuck because a server exits. Again, we need to handle this gracefully. Metrics in #84 may be able to help (gauge displaying the current number of backups InProgress or Failed) visualize these issues, but it won't fix them.

rosskukulinski avatar Jun 14 '18 04:06 rosskukulinski

A backup/restore won't get stuck during normal shutdown operations because the Ark server waits for all in-progress work to complete before terminating. If that work takes too long and exceeds the pod's deletion grace period, then Kubernetes would forcefully kill the container, and that would interrupt the in-progress work before it had a chance to finish.

There are, however, plenty of situations where the Ark server could exit while doing work:

  • exceeding the grace period on a normal shutdown
  • OOM killed
  • a bug of some sort that causes a crash

This is definitely something we need to handle

ncdc avatar Jun 14 '18 11:06 ncdc

This needs a quick test (from code) to trigger:

  • get a backup thats new
  • patch it to inprogress
  • patch again to inprogress (this should fail, since already in progress)

rosskukulinski avatar Aug 06 '18 20:08 rosskukulinski

I had a thought about how to implement this.

Each ark server process is assigned a unique identifier - the name of the pod (we can get the value using the downward API and pass it to ark server as a flag).

Each controller worker is also assigned a unique identifier.

When a new item (backup, restore) is processed by a controller, the first thing the controller attempts to do is set status.arkServerID and status.workerID. Assuming that succeeds without a conflict, the worker can proceed to do its work.

When a worker sees an InProgress item, it checks status.arkServerID

  • If there are no running pods matching that name, the worker resets the status back to New for reprocessing.
  • If there is a running pod matching that name, and it matches this ark server, reset the status to New if there are no active workers matching status.workerID

The controller would also need add event handlers for pods. Upon a change, we'd want to reevaluate all InProgress items to see if they need to be taken over.

There's probably a lot more to flesh out here, but I wanted to write this down before I forgot it.

ncdc avatar Oct 11 '18 19:10 ncdc

It would be interesting to either be able to limit the number of concurrent tasks or have the option to use it as it is right now (backup queue).

It would be better for me, because I would like to limit the load on my shared file servers (Ceph RBD / CephFS) while backups are being taken. This way, I can ensure that workloads are not impacted too much by the backup tasks.

xmath279 avatar Jul 17 '20 23:07 xmath279

@xmath279 Thanks for that feedback!

nrb avatar Aug 11 '20 18:08 nrb

Will be done after design is finished - https://github.com/vmware-tanzu/velero/issues/2601

dsu-igeek avatar Feb 22 '21 22:02 dsu-igeek

This also limits our scaling.

An idea would be to split velero servers into shards based on labels. All that should be needed is that Velero can reconcile backups based on a label selector and ignore the rest. Imagine:

deployment velero1: watches label velero-shard=1
deployment velero2: watches label velero-shard=2

backup1: label velero-shard=1
backup2: label velero-shard=2

There should be no further interaction at the level of the custom resource.

WDYT? cc @nrb

Oblynx avatar Oct 27 '21 07:10 Oblynx

This is a major drawback for this tool specially when using restic integration. Having a DevOps approach where we can provide with Velero as a Service to different teams for them to plan their own backup policies for their applications is chaos without parallelism.

Not only bad for DevOps but this would very negatively affect any RTO, and create a lot incertidumbre with RPOs as you'll know when you schedule a backup but not when it will be its turn in a queue shared across all different application teams in a cluster.

It seems that this feature request is not marked as P1 - Important, maybe this could be reconsidered? @eleanor-millman

pupseba avatar Feb 04 '22 10:02 pupseba

Thanks for the points. We will be reviewing this in a few weeks when we go through the items still open in the 1.8 project.

eleanor-millman avatar Feb 11 '22 21:02 eleanor-millman

Hello 🖖,

Any news about the multi-jobs feature at the same time?

qdupuy avatar May 02 '22 12:05 qdupuy

Hi @qdupuy no immediate news. I can tell you that parallelization of Velero (which would probably include this work) is something on our radar, but we first are focusing on other work like adding a data mover to Velero and bringing the CSI plugin to GA.

eleanor-millman avatar May 03 '22 21:05 eleanor-millman

hi folks ✋ any updates here?

ugur99 avatar Feb 24 '25 16:02 ugur99

@ugur99 It's on the roadmap, but work has not yet begun on the feature. This might make it into Velero 1.17 or 1.18 (for concurrent backups). Concurrent restores is not on the immediate roadmap, since the need for that feature is quite a bit lower.

sseago avatar Feb 24 '25 17:02 sseago

I would argue that the need for concurrent restores is much higher than the need for concurrent backups. In the event of a disaster recovery, the need to recover/restore as quick as possible is paramount as it directly impacts business continuity. The need to backup concurrently is not as critical, as backups are typically scheduled on a daily or weekly basis, so it doesn't matter how long it takes.

Design-wise, the controller could perhaps assign one Restore to be executed by one Job. This way, as a user I could choose to create multiple smaller Backup objects (e.g. one per Namespace), and then restore each of them concurrently. This way we could have multiple "restore jobs" running at the same time, each restoring a different Backup.

illrill avatar May 22 '25 12:05 illrill

backups happens much more frequently across much of the cluster. restore is a much more rare event.

restores need to be sync'd so that restore1 doesn't create something that restore2 which could overlap might crash into each other operations.

kaovilai avatar May 22 '25 13:05 kaovilai

During the earlier design discussions, for parallel backups running separate jobs was rejected in favor of using the MaxConcurrentReconciles controller-runtime configuration to allow more than one reconcile at the same time.

If/when we extend this to restore, we will most likely take the same approach in restore.

It is worth pointing out here that for backups or restores for which copying volume data is the bottleneck (i.e. backup or restore of large-ish PVs), that part is already done in parallel, and once a backup or restore is just waiting for DataUploads or DataDownloads to complete, the next backup or restore can begin, even with the current Velero release.

sseago avatar Jun 02 '25 16:06 sseago