k8up icon indicating copy to clipboard operation
k8up copied to clipboard

Integrated Support for RWO PVCs File Backups

Open tobru opened this issue 4 years ago • 24 comments

Summary

As user of K8up I want to backup PVCs with RWO access mode So that my precious files stored in RWO volumes are backed up as well.

Context

Backing up RWO volumes is currently not directly supported by K8up and is only possible via workarounds.

Out of Scope

Further links

  • https://kubernetes.io/docs/concepts/storage/persistent-volumes/#node-affinity

Acceptance criteria

Given a PVC with RWO access mode in Bound phase
And PVC has `k8up.io/backup=false` annotation
When a backup is scheduled
Then the backup for this PVC is skipped
Given a PVC with RWO access mode _not_ in Bound phase
When a backup is scheduled
Then the backup for this PVC is skipped
Given a PVC with RWO access mode in Bound phase
And an application Pod has mounted that PVC
When a backup runs
Then the backup Pod runs on the same node as the app Pod.
And the files are backed up.
Given a PVC with RWO access mode in Bound phase
And an application Pod has mounted that PVC
And the app Pod has the `k8up.io/backupcommand` annotation
When a backup runs
Then the backup is done via existing backup command
Given a PVC with RWO access mode in Bound phase
And no application Pod has mounted that PVC
When a backup runs
Then K8up will fail a backup since it can't reliably determine the target node where the backup Pod should run.
Given a PVC with RWO access mode in Bound phase
And no application Pod has mounted that PVC
And the linked PV has a NodeAffinity rule defined
When a backup runs
Then the backup Pod runs on the same node as configured via NodeAffinity spec.
And the files are backed up.
Given a PVC with RWO access mode in Bound phase
And no application Pod has mounted that PVC
And the PVC has the `k8up.io/hostname` annotation
When a backup runs
Then the backup Pod runs on the node as configured via annotation spec.
And the files are backed up.
Given a PVC with RWO access mode in Bound phase
And no application Pod has mounted that PVC
And the PVC has the `k8up.io/hostname` annotation
And the linked PV has a NodeAffinity rule defined
When a backup runs
Then the backup Pod runs on the same node as configured via NodeAffinity spec since that takes precedence in Kubernetes
And the files are backed up.

Implementation Ideas

During the implementation of #12 we found that with the correct Pod affinity a RWO PV can be mounted to a second Pod, given it's on the same host.

With that information it should be possible to support file backups on RWO PVCs with the following process:

  • one backup job for all RWX and BackupCommands as already implemented
  • for each RWO marked for backup we'll spawn another job with correct Pod affinity, so it will be scheduled on the node where the RWO PVC is mounted

File backups are much more reliable and faster. Until now we use the BackupCommands to stream the backups via stdin/stdout to wrestic for RWO PVCs.

Of course if any application aware things (DB dumps, etc.) should still be done via BackupCommands.

Unfortunately unlike the backup commands the Pod affinity needs to be set in the operator, which has its own set of problems:

  • the Pod affinity is fixed at the time of creation, so if something changes on the hosts (reboots, etc.) the Pod affinity may not match correctly anymore -> may be solvable via clever podaffinity and retries
  • a lot more Pods may get created during backups (performance issues?)
  • but things can be run parallel

tobru avatar Jan 21 '21 16:01 tobru

This would really make k8up Swiss army knife of k8s backups :)

Troyhy avatar Sep 02 '21 18:09 Troyhy

With "Introducing Single Pod Access Mode for PersistentVolumes" (ReadWriteOncePod access mode) we'll have to absolutely find a proper way to back up data from Pods using volumes with these access mode(s).

EphemeralContainers come to my mind for that. See also Ephemeral Containers.

tobru avatar Sep 13 '21 11:09 tobru

If this has to be implemented, an idea might be to switch to Ephemeral containers as the default way to do backups and restores. It may reduce the complexity, as there would be just one way to do backups and restores (except for the stdout backups). (Instead of the alternative, which is running Ephemeral containers for RWOP, Jobs for regular backups and restores, etc.)

cimnine avatar Sep 14 '21 11:09 cimnine

the Pod affinity is fixed at the time of creation, so if something changes on the hosts (reboots, etc.) the Pod affinity may not match correctly anymore -> may be solvable via clever podaffinity and retries a lot more Pods may get created during backups (performance issues?)

how about deploy an daemonset to backup volumes, the daemonset just mount hostpath /var/lib/kubelet/pods/, so pod can access volumes directly. k8up jobs just send operation command to the appropriate daemonset pod.

however, this way will break the current design architecture, If it doesn't make sense, just forget it.

jpsn123 avatar Sep 16 '21 01:09 jpsn123

I have no real knowledge about the k8up internals, but this is how I do it manually.

Launch an alpine pod that is on the same node as workload. The pod has a volume claim to workload's volume and backup is run. The pod is deleted after.

So could k8up launch a batch/v1 Job with correct affinity rules. Monitor the execution of that job, retry if needed. This backup job could have some settings or even template that you could tailor... resource limits etc.

ofc this might be really different approach and need even more changes to the architecture.

Troyhy avatar Sep 16 '21 05:09 Troyhy

@jpsn123 That's actually how velero does their backups. It's a very simple solution but I fear there might be some security concerns with this approach. Velero's design is more intended for single tenant clusters where k8up is intended for multi tenant clusters where each tenant should only be able to backup their own PVCs.

One idea I had how we could solve this is similar to @Troyhy's: Create a job for each Pod that has a volume attached to it. Then we can use pod affinity to co-schedule them on the same node. We could then also set some concurrency settings so that X such backup jobs are running in parallel. That could even speed up the whole process a bit. If there are performance concerns on a specific cluster the setting can be set to only run one pod at a time so the performance would be the same as now. The drawback is that only running pods will be backed up. But with the pre-backup pod templates it's possible to simply spin up pods that mount any PVCs.

Another idea I had: using CSI snapshots and then mount the snapshots into the k8up backup pods. The problem is that this requires a CSI provider that supports snapshots.

Kidswiss avatar Sep 16 '21 06:09 Kidswiss

Another way of doing things is to have a sidecar container attached to each pod one wants to back up, which is always running and just listening for some signal. The k8up Job could then simply signal each running sidecar to perform backups. This would also work for the new ReadWriteOncePod volume mode, since the sidecar would be in the same pod.

HubbeKing avatar Sep 16 '21 06:09 HubbeKing

@Kidswiss csi is not recommended here. As a Swiss army knife, K8up should minimize external dependence. I actually use CSI snap as my primary backup and K8UP as my secondary backup.

Considering that K8UP is muti-tenant, this is a great design and what I want. I think the best solution is to use the JOB you mentioned.

Create a job for each Pod that has a volume attached to it.

jpsn123 avatar Sep 17 '21 03:09 jpsn123

@jpsn123 While I find the idea with backing up CSI snapshot very elegant I agree with you. If we switch the "we can only backup RWX volumes" with "we can only backup CSI volumes" will probably lead to another ticket like this down the road.

@HubbeKing We decided against such an approach from the start. When we started with k8up there were other backup solutions that did it that way. Which was actually one of the points why we started with k8up because patching other pods and deployments outside the scope of the backup system feels like the wrong approach for this problem.

Kidswiss avatar Sep 17 '21 06:09 Kidswiss

Fair! I'm just personally not convinced that EphemeralContainers are the right choice, given their Alpha nature. They would be perfect for doing backups regardless of the volume type since they can attach to any pod, but since they remain Alpha you'd need to ensure that your Kubernetes distribution properly handles the EphemeralContainers Feature Gate.

And that could make deploying k8up more complicated - I have no idea if managed kubernetes solutions in the cloud even let you set Feature Gates, for instance. And given that they entered Alpha in 1.16 and are still in Alpha as of 1.22, it could be a while before the feature enters Beta and gets turned on by default.

HubbeKing avatar Sep 17 '21 06:09 HubbeKing

EphemeralContainers will move to beta in 1.23, see KEP-277.

We should find a way to have different ways for RWO backups, where the user can choose the best matching way for their infrastructure.

tobru avatar Sep 17 '21 08:09 tobru

EphemeralContainers - this is only for existing pods, but some applications can be use cron jobs or own flow to prepare a backup and leave data in an unmounted pv which need to be processed with k8up later.

R-omk avatar Sep 17 '21 08:09 R-omk

EphemeralContainers - this is only for existing pods, but some applications can be use cron jobs or own flow to prepare a backup and leave data in an unmounted pv which need to be processed with k8up later.

We have that covered in K8up with PreBackupPod (See https://k8up.io/k8up/1.2/how-tos/prebackuppod.html)

tobru avatar Sep 17 '21 09:09 tobru

@HubbeKing That's why I brought up the "job per PVC" idea. As @tobru said we don't have to implement just one way to do it, but multiple.

The three candidates I see right now to solve the problem:

  • Job per PVC -> this can be used on any version of k8s and any storage packend
  • EphemeralContainers -> are great if supported, so could be considered for when they are beta/GA
  • CSI Snapshots -> could be an additional feature for people that have CSI providers capable of doing snapshots
  • Velero style -> mounting /var/lib/kubelet/pods/ and working through that (my least favorite though).

To get the most compatibility right now I'd tend to the "job per PVC" option first. It would also be the one that's the simplest to implement given the current state of k8up. With configurable parallelism this could also greatly improve the speed for cases where there are a ton of PVCs.

@tobru @cimnine @ccremer what do you think?

Kidswiss avatar Sep 17 '21 09:09 Kidswiss

I'd start with a PoC for job per PVC. It would be interesting how it behaves in some failure scenarios (mainly node/pod reboots)

ccremer avatar Sep 17 '21 16:09 ccremer

We looked into this in more detail and proposed a workflow how RWO backups could be done. They are detailed in the issue description in the acceptance criteria.

In short: PVC backup with RWO is done via running a backup in a pod that runs on the same node. To do this, we determine the node in following precedences

  1. For a PVC, figure out if there's a Pod using this PVC, then use the same hostname.
  2. If the PVC is not mounted, see if the PV has a NodeAffinity rule configured and use this (relevant for local volumes or hostpaths and other storage backends)
  3. If the PVC has an explicit annotation with a node selector, then use this
  4. Fail/Skip the backup. By now we can't reliably determine which node the volume is mountable, and here we rather fail the backup than backing up wrong/empty data.

If you have any feedback, please let us know.

ccremer avatar Dec 02 '21 14:12 ccremer

@ccremer

Maybe one note for the proposal:

a lot more Pods may get created during backups (performance issues?)

We really should limit the parallel backups somehow. Restic can use quite a bit of CPU and memory. It also generates quite some IO load on the PV backend if all run at the same time.

Kidswiss avatar Dec 02 '21 14:12 Kidswiss

Good point on parallism, though I'm not sure about the impact. Is this concern more towards having multiple Pods per restic repository or did you really mean per PV? Should we lock RWO backups to 1 pod per PVC? Or 1 Pod per namespace...? Should the other backups just skip or wait in line?

ccremer avatar Dec 02 '21 14:12 ccremer

@ccremer the repository should not be the issue. Except if it's some selfhosted minio or so. Restic allows parallel backups to the same repo. My concern is the performance impact on the nodes and the storage.

Okay let me elaborate a bit more how this could impact a cluster:

Let's say we have 3 nodes. Also let's a assume it's not very well balanced and one node has many more pods with mounted PVCs than others.

Node 1: 5 pods with PVCs Node 2: 2 pods with PVCs Node 3: 1 pods with PVCs

Now if we schedule all the backup jobs at the same time for each PVC then node 1 would get hammered with 5 pods running restic. While the other nodes will have a massively lower load. Depending on the amount of files to backup this can easily require a pretty huge chunk of the nodes CPU and memory. From the storage view (rook for example) gets hammered with 8 pods that are producing quite an amount of IO in a short time. Potentially further harming the performance of the whole cluster.

To mitigate this we only run 1 backup pod per namespace at a time. Each pod would be responsible to backup exactly 1 volume. Once it's done the next pod will start until all PVCs have been backupped. This way it should basically be the same performance impact as we have today.

This has some nice pros, like we could make the parallelism adjustable and allow x pods per namespace for larger clusters. It could also be possible to handle some issues with the affinity. For example if the affinity becomes invalid between the time the operator sets it and the pod actually runs, we could reschedule the stuck pods.

What do you think?

Kidswiss avatar Dec 02 '21 15:12 Kidswiss

I see. Thanks for the scenario.

I think this is primarily a problem because we decided for an opt-out target-all-PCVs design years ago. If a schedule would only target a specific PVC/pod as defined in some spec then each backup schedule could avoid this spike by simply having slightly different schedules.

Question is, do we want to do something against this? Are we willing to introduce more magic and special behavior for performance concerns that may potentially depend on certain storage backends anyway?

Or is this something we can accept without introducing throttling/locking and revisit this topic when it actually is a problem? I'm not sure if adding more config options to adjust parallelism simplifies K8up administration. Rescheduling Pods after changed node topology doesn't sound like a problem we really want to deal with, it only makes K8up harder to understand and maintain.

What are the chances that this node imbalance is a problem where these pods are ALSO in the same schedule? I think this problem could be mitigated by documenting the behaviour, some caveats and using randomized schedules.

ccremer avatar Dec 02 '21 18:12 ccremer

@ccremer , as a quick solution may be enough to use pod antiaffinity.

R-omk avatar Dec 02 '21 20:12 R-omk

@ccremer Sure we could start with the easier approach and get some actual real life experience on the impact.

Kidswiss avatar Dec 03 '21 07:12 Kidswiss

Would it make sense to focus this on CSI providers that implement the snapshot feature (e.g. rook-ceph) with the help of external-snapshotter?

It would be neat to see the following take place:

  1. Cluster admin deploys external-snapshotter and creates a VolumeSnapshotClass and has the annotation k8up.io/is-snapshot-class: "true":
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-rbdplugin-snapclass
  annotations:
    k8up.io/is-snapshot-class: "true"
driver: rook-ceph.rbd.csi.ceph.com
parameters:
  clusterID: rook-ceph
  csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/snapshotter-secret-namespace: rook-ceph
deletionPolicy: Delete
  1. Cluster admin labels a PVC k8up.io/backup-volume: "enabled":
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: awesome-config-v1
  namespace: default
  labels:
    k8up.io/backup-volume: "enabled"
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: ceph-block
  1. k8up creates a CSI snapshot based on some config schedule.
  2. When snapshot completes k8up turns this into a temporary PV/PVC and then creates a job (with a container for restic), mounts the temporary PVC to it and runs restic copy to durable storage
  3. When that job is done, k8up will then remove the temporary PVC.

This is a pretty high level view of what could be done and there's a lot left out but I think there is a pretty good sweet spot to support CSI providers that have snapshotting functions.

onedr0p avatar Jan 05 '22 21:01 onedr0p

Hi @onedr0p Thanks for you suggestion. Our idea is to support RWO backups that aren't exclusive to volumes provisioned by a CSI provider. We might later optimize backups in certain conditions like CSI snapshots, but I don't want to make promises here. First goal is to support RWO on generic PVs. There are a lot of people that aren't using CSI-provisioned volumes, our clusters/customers included.

ccremer avatar Jan 06 '22 09:01 ccremer

Given a PVC with RWO access mode in Bound phase
And no application Pod has mounted that PVC
And the linked PV has a NodeAffinity rule defined
When a backup runs
Then the backup Pod runs on the same node as configured via NodeAffinity spec.
And the files are backed up.

if the specified node has taints defined we'd need to fetch the node in order to retrieve the necessary tolerations. If a pod is bound we can fetch it from the pod but with a node we'd need permissions to fetch nodes. Not sure if this is desired?

mweibel avatar Jan 26 '23 10:01 mweibel

@mweibel not sure why taints are relevant. PVs can have a node affinity, see https://kubernetes.io/docs/concepts/storage/persistent-volumes/#node-affinity. I believe that Kubernetes automatically schedules the pod on that node (tainted or not -> eventual consistency). It's possible that this requirement is a no-brainer.

ccremer avatar Jan 26 '23 11:01 ccremer

@mweibel I agree with @ccremer's assessment. For unmounted PVCs the pods should be scheduled on the right node without any logic from our side.

However, we might need to treat unmounted volumes separately from the others, otherwise it could mess up how we decided to schedule the pods.

Kidswiss avatar Jan 26 '23 12:01 Kidswiss

@mweibel not sure why taints are relevant. PVs can have a node affinity, see https://kubernetes.io/docs/concepts/storage/persistent-volumes/#node-affinity. I believe that Kubernetes automatically schedules the pod on that node (tainted or not -> eventual consistency). It's possible that this requirement is a no-brainer.

Not sure if I wasn't clear enough, let me rephrase:

  • PV has nodeAffinity set to a node which is tainted
  • PVC bound to that PV
  • any pod who wants to use that PVC/PV combo needs to have a toleration for the taints on the node

So, given the above acceptance critieria: To create a pod which is able to backup the PVC, the pod needs to tolerate the taints of the node.

Does that make it more clear?

Example: Taint target node:

$ kubectl taint node k8up-v1.24.4-worker2 test=bar:NoSchedule

Apply PV/PVC:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv0003
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Delete
  storageClassName: manual
  hostPath:
    path: /tmp
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
                - k8up-v1.24.4-worker2
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: myclaim
spec:
  accessModes:
    # So it works in KIND
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: manual
---

This PVC you can't use without tolerating the taint test=bar:NoSchedule

mweibel avatar Jan 26 '23 12:01 mweibel

FTR @Kidswiss and I discussed this and decided to not implement this particular case. I created a follow-up task #805 with necessary details.

mweibel avatar Jan 26 '23 13:01 mweibel

Ah, now I get it. Normally nodes are tainted temporarily, e.g. Disk or memory pressure so I kind of assumed that eventually it will be resolved. However, there can be nodes that are purposefully tainted for some other reason.

Probably best tackled in another story though as you wrote :+1:

ccremer avatar Jan 26 '23 13:01 ccremer