k8up
k8up copied to clipboard
Integrated Support for RWO PVCs File Backups
Summary
As user of K8up I want to backup PVCs with RWO access mode So that my precious files stored in RWO volumes are backed up as well.
Context
Backing up RWO volumes is currently not directly supported by K8up and is only possible via workarounds.
Out of Scope
Further links
- https://kubernetes.io/docs/concepts/storage/persistent-volumes/#node-affinity
Acceptance criteria
Given a PVC with RWO access mode in Bound phase
And PVC has `k8up.io/backup=false` annotation
When a backup is scheduled
Then the backup for this PVC is skipped
Given a PVC with RWO access mode _not_ in Bound phase
When a backup is scheduled
Then the backup for this PVC is skipped
Given a PVC with RWO access mode in Bound phase
And an application Pod has mounted that PVC
When a backup runs
Then the backup Pod runs on the same node as the app Pod.
And the files are backed up.
Given a PVC with RWO access mode in Bound phase
And an application Pod has mounted that PVC
And the app Pod has the `k8up.io/backupcommand` annotation
When a backup runs
Then the backup is done via existing backup command
Given a PVC with RWO access mode in Bound phase
And no application Pod has mounted that PVC
When a backup runs
Then K8up will fail a backup since it can't reliably determine the target node where the backup Pod should run.
Given a PVC with RWO access mode in Bound phase
And no application Pod has mounted that PVC
And the linked PV has a NodeAffinity rule defined
When a backup runs
Then the backup Pod runs on the same node as configured via NodeAffinity spec.
And the files are backed up.
Given a PVC with RWO access mode in Bound phase
And no application Pod has mounted that PVC
And the PVC has the `k8up.io/hostname` annotation
When a backup runs
Then the backup Pod runs on the node as configured via annotation spec.
And the files are backed up.
Given a PVC with RWO access mode in Bound phase
And no application Pod has mounted that PVC
And the PVC has the `k8up.io/hostname` annotation
And the linked PV has a NodeAffinity rule defined
When a backup runs
Then the backup Pod runs on the same node as configured via NodeAffinity spec since that takes precedence in Kubernetes
And the files are backed up.
Implementation Ideas
During the implementation of #12 we found that with the correct Pod affinity a RWO PV can be mounted to a second Pod, given it's on the same host.
With that information it should be possible to support file backups on RWO PVCs with the following process:
- one backup job for all RWX and BackupCommands as already implemented
- for each RWO marked for backup we'll spawn another job with correct Pod affinity, so it will be scheduled on the node where the RWO PVC is mounted
File backups are much more reliable and faster. Until now we use the BackupCommands to stream the backups via stdin/stdout to wrestic for RWO PVCs.
Of course if any application aware things (DB dumps, etc.) should still be done via BackupCommands.
Unfortunately unlike the backup commands the Pod affinity needs to be set in the operator, which has its own set of problems:
- the Pod affinity is fixed at the time of creation, so if something changes on the hosts (reboots, etc.) the Pod affinity may not match correctly anymore -> may be solvable via clever podaffinity and retries
- a lot more Pods may get created during backups (performance issues?)
- but things can be run parallel
This would really make k8up Swiss army knife of k8s backups :)
With "Introducing Single Pod Access Mode for PersistentVolumes" (ReadWriteOncePod access mode) we'll have to absolutely find a proper way to back up data from Pods using volumes with these access mode(s).
EphemeralContainers come to my mind for that. See also Ephemeral Containers.
If this has to be implemented, an idea might be to switch to Ephemeral containers as the default way to do backups and restores. It may reduce the complexity, as there would be just one way to do backups and restores (except for the stdout backups). (Instead of the alternative, which is running Ephemeral containers for RWOP, Jobs for regular backups and restores, etc.)
the Pod affinity is fixed at the time of creation, so if something changes on the hosts (reboots, etc.) the Pod affinity may not match correctly anymore -> may be solvable via clever podaffinity and retries a lot more Pods may get created during backups (performance issues?)
how about deploy an daemonset to backup volumes, the daemonset just mount hostpath /var/lib/kubelet/pods/, so pod can access volumes directly. k8up jobs just send operation command to the appropriate daemonset pod.
however, this way will break the current design architecture, If it doesn't make sense, just forget it.
I have no real knowledge about the k8up internals, but this is how I do it manually.
Launch an alpine pod that is on the same node as workload. The pod has a volume claim to workload's volume and backup is run. The pod is deleted after.
So could k8up launch a batch/v1 Job with correct affinity rules. Monitor the execution of that job, retry if needed. This backup job could have some settings or even template that you could tailor... resource limits etc.
ofc this might be really different approach and need even more changes to the architecture.
@jpsn123 That's actually how velero does their backups. It's a very simple solution but I fear there might be some security concerns with this approach. Velero's design is more intended for single tenant clusters where k8up is intended for multi tenant clusters where each tenant should only be able to backup their own PVCs.
One idea I had how we could solve this is similar to @Troyhy's: Create a job for each Pod that has a volume attached to it. Then we can use pod affinity to co-schedule them on the same node. We could then also set some concurrency settings so that X such backup jobs are running in parallel. That could even speed up the whole process a bit. If there are performance concerns on a specific cluster the setting can be set to only run one pod at a time so the performance would be the same as now. The drawback is that only running pods will be backed up. But with the pre-backup pod templates it's possible to simply spin up pods that mount any PVCs.
Another idea I had: using CSI snapshots and then mount the snapshots into the k8up backup pods. The problem is that this requires a CSI provider that supports snapshots.
Another way of doing things is to have a sidecar container attached to each pod one wants to back up, which is always running and just listening for some signal. The k8up Job could then simply signal each running sidecar to perform backups. This would also work for the new ReadWriteOncePod volume mode, since the sidecar would be in the same pod.
@Kidswiss csi is not recommended here. As a Swiss army knife, K8up should minimize external dependence. I actually use CSI snap as my primary backup and K8UP as my secondary backup.
Considering that K8UP is muti-tenant, this is a great design and what I want. I think the best solution is to use the JOB you mentioned.
Create a job for each Pod that has a volume attached to it.
@jpsn123 While I find the idea with backing up CSI snapshot very elegant I agree with you. If we switch the "we can only backup RWX volumes" with "we can only backup CSI volumes" will probably lead to another ticket like this down the road.
@HubbeKing We decided against such an approach from the start. When we started with k8up there were other backup solutions that did it that way. Which was actually one of the points why we started with k8up because patching other pods and deployments outside the scope of the backup system feels like the wrong approach for this problem.
Fair! I'm just personally not convinced that EphemeralContainers are the right choice, given their Alpha nature. They would be perfect for doing backups regardless of the volume type since they can attach to any pod, but since they remain Alpha you'd need to ensure that your Kubernetes distribution properly handles the EphemeralContainers Feature Gate.
And that could make deploying k8up more complicated - I have no idea if managed kubernetes solutions in the cloud even let you set Feature Gates, for instance. And given that they entered Alpha in 1.16 and are still in Alpha as of 1.22, it could be a while before the feature enters Beta and gets turned on by default.
EphemeralContainers will move to beta in 1.23, see KEP-277.
We should find a way to have different ways for RWO backups, where the user can choose the best matching way for their infrastructure.
EphemeralContainers - this is only for existing pods, but some applications can be use cron jobs or own flow to prepare a backup and leave data in an unmounted pv which need to be processed with k8up later.
EphemeralContainers - this is only for existing pods, but some applications can be use cron jobs or own flow to prepare a backup and leave data in an unmounted pv which need to be processed with k8up later.
We have that covered in K8up with PreBackupPod (See https://k8up.io/k8up/1.2/how-tos/prebackuppod.html)
@HubbeKing That's why I brought up the "job per PVC" idea. As @tobru said we don't have to implement just one way to do it, but multiple.
The three candidates I see right now to solve the problem:
- Job per PVC -> this can be used on any version of k8s and any storage packend
- EphemeralContainers -> are great if supported, so could be considered for when they are beta/GA
- CSI Snapshots -> could be an additional feature for people that have CSI providers capable of doing snapshots
- Velero style -> mounting
/var/lib/kubelet/pods/and working through that (my least favorite though).
To get the most compatibility right now I'd tend to the "job per PVC" option first. It would also be the one that's the simplest to implement given the current state of k8up. With configurable parallelism this could also greatly improve the speed for cases where there are a ton of PVCs.
@tobru @cimnine @ccremer what do you think?
I'd start with a PoC for job per PVC. It would be interesting how it behaves in some failure scenarios (mainly node/pod reboots)
We looked into this in more detail and proposed a workflow how RWO backups could be done. They are detailed in the issue description in the acceptance criteria.
In short: PVC backup with RWO is done via running a backup in a pod that runs on the same node. To do this, we determine the node in following precedences
- For a PVC, figure out if there's a Pod using this PVC, then use the same hostname.
- If the PVC is not mounted, see if the PV has a NodeAffinity rule configured and use this (relevant for local volumes or hostpaths and other storage backends)
- If the PVC has an explicit annotation with a node selector, then use this
- Fail/Skip the backup. By now we can't reliably determine which node the volume is mountable, and here we rather fail the backup than backing up wrong/empty data.
If you have any feedback, please let us know.
@ccremer
Maybe one note for the proposal:
a lot more Pods may get created during backups (performance issues?)
We really should limit the parallel backups somehow. Restic can use quite a bit of CPU and memory. It also generates quite some IO load on the PV backend if all run at the same time.
Good point on parallism, though I'm not sure about the impact. Is this concern more towards having multiple Pods per restic repository or did you really mean per PV? Should we lock RWO backups to 1 pod per PVC? Or 1 Pod per namespace...? Should the other backups just skip or wait in line?
@ccremer the repository should not be the issue. Except if it's some selfhosted minio or so. Restic allows parallel backups to the same repo. My concern is the performance impact on the nodes and the storage.
Okay let me elaborate a bit more how this could impact a cluster:
Let's say we have 3 nodes. Also let's a assume it's not very well balanced and one node has many more pods with mounted PVCs than others.
Node 1: 5 pods with PVCs Node 2: 2 pods with PVCs Node 3: 1 pods with PVCs
Now if we schedule all the backup jobs at the same time for each PVC then node 1 would get hammered with 5 pods running restic. While the other nodes will have a massively lower load. Depending on the amount of files to backup this can easily require a pretty huge chunk of the nodes CPU and memory. From the storage view (rook for example) gets hammered with 8 pods that are producing quite an amount of IO in a short time. Potentially further harming the performance of the whole cluster.
To mitigate this we only run 1 backup pod per namespace at a time. Each pod would be responsible to backup exactly 1 volume. Once it's done the next pod will start until all PVCs have been backupped. This way it should basically be the same performance impact as we have today.
This has some nice pros, like we could make the parallelism adjustable and allow x pods per namespace for larger clusters. It could also be possible to handle some issues with the affinity. For example if the affinity becomes invalid between the time the operator sets it and the pod actually runs, we could reschedule the stuck pods.
What do you think?
I see. Thanks for the scenario.
I think this is primarily a problem because we decided for an opt-out target-all-PCVs design years ago. If a schedule would only target a specific PVC/pod as defined in some spec then each backup schedule could avoid this spike by simply having slightly different schedules.
Question is, do we want to do something against this? Are we willing to introduce more magic and special behavior for performance concerns that may potentially depend on certain storage backends anyway?
Or is this something we can accept without introducing throttling/locking and revisit this topic when it actually is a problem? I'm not sure if adding more config options to adjust parallelism simplifies K8up administration. Rescheduling Pods after changed node topology doesn't sound like a problem we really want to deal with, it only makes K8up harder to understand and maintain.
What are the chances that this node imbalance is a problem where these pods are ALSO in the same schedule? I think this problem could be mitigated by documenting the behaviour, some caveats and using randomized schedules.
@ccremer , as a quick solution may be enough to use pod antiaffinity.
@ccremer Sure we could start with the easier approach and get some actual real life experience on the impact.
Would it make sense to focus this on CSI providers that implement the snapshot feature (e.g. rook-ceph) with the help of external-snapshotter?
It would be neat to see the following take place:
- Cluster admin deploys external-snapshotter and creates a
VolumeSnapshotClassand has the annotationk8up.io/is-snapshot-class: "true":
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-rbdplugin-snapclass
annotations:
k8up.io/is-snapshot-class: "true"
driver: rook-ceph.rbd.csi.ceph.com
parameters:
clusterID: rook-ceph
csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/snapshotter-secret-namespace: rook-ceph
deletionPolicy: Delete
- Cluster admin labels a PVC
k8up.io/backup-volume: "enabled":
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: awesome-config-v1
namespace: default
labels:
k8up.io/backup-volume: "enabled"
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: ceph-block
- k8up creates a CSI snapshot based on some config schedule.
- When snapshot completes k8up turns this into a temporary PV/PVC and then creates a job (with a container for restic), mounts the temporary PVC to it and runs
restic copy to durable storage - When that job is done, k8up will then remove the temporary PVC.
This is a pretty high level view of what could be done and there's a lot left out but I think there is a pretty good sweet spot to support CSI providers that have snapshotting functions.
Hi @onedr0p Thanks for you suggestion. Our idea is to support RWO backups that aren't exclusive to volumes provisioned by a CSI provider. We might later optimize backups in certain conditions like CSI snapshots, but I don't want to make promises here. First goal is to support RWO on generic PVs. There are a lot of people that aren't using CSI-provisioned volumes, our clusters/customers included.
Given a PVC with RWO access mode in Bound phase
And no application Pod has mounted that PVC
And the linked PV has a NodeAffinity rule defined
When a backup runs
Then the backup Pod runs on the same node as configured via NodeAffinity spec.
And the files are backed up.
if the specified node has taints defined we'd need to fetch the node in order to retrieve the necessary tolerations. If a pod is bound we can fetch it from the pod but with a node we'd need permissions to fetch nodes. Not sure if this is desired?
@mweibel not sure why taints are relevant. PVs can have a node affinity, see https://kubernetes.io/docs/concepts/storage/persistent-volumes/#node-affinity. I believe that Kubernetes automatically schedules the pod on that node (tainted or not -> eventual consistency). It's possible that this requirement is a no-brainer.
@mweibel I agree with @ccremer's assessment. For unmounted PVCs the pods should be scheduled on the right node without any logic from our side.
However, we might need to treat unmounted volumes separately from the others, otherwise it could mess up how we decided to schedule the pods.
@mweibel not sure why taints are relevant. PVs can have a node affinity, see https://kubernetes.io/docs/concepts/storage/persistent-volumes/#node-affinity. I believe that Kubernetes automatically schedules the pod on that node (tainted or not -> eventual consistency). It's possible that this requirement is a no-brainer.
Not sure if I wasn't clear enough, let me rephrase:
- PV has nodeAffinity set to a node which is tainted
- PVC bound to that PV
- any pod who wants to use that PVC/PV combo needs to have a toleration for the taints on the node
So, given the above acceptance critieria: To create a pod which is able to backup the PVC, the pod needs to tolerate the taints of the node.
Does that make it more clear?
Example: Taint target node:
$ kubectl taint node k8up-v1.24.4-worker2 test=bar:NoSchedule
Apply PV/PVC:
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv0003
spec:
capacity:
storage: 5Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
storageClassName: manual
hostPath:
path: /tmp
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- k8up-v1.24.4-worker2
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: myclaim
spec:
accessModes:
# So it works in KIND
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: manual
---
This PVC you can't use without tolerating the taint test=bar:NoSchedule
FTR @Kidswiss and I discussed this and decided to not implement this particular case. I created a follow-up task #805 with necessary details.
Ah, now I get it. Normally nodes are tainted temporarily, e.g. Disk or memory pressure so I kind of assumed that eventually it will be resolved. However, there can be nodes that are purposefully tainted for some other reason.
Probably best tackled in another story though as you wrote :+1: