kubernetes-zfs-provisioner Send/recv datasets between nodes

Hi! You have a pretty neat project here.

Have you considered using zfs send/recv to migrate datasets between nodes to work around datasets being locked to a particular node?

I'm imagining a few potential designs:

When a dataset is on one node and the pod that needs it is scheduled on a different node, zfs send it to the new node and drop it from the current node.
A main backup/restore pool that maintains the canonical copies of datasets
- When a pod is scheduled on a node, the node acquires a lease on any datasets it needs and recv's the data from the main pool.
- The pod can start once the dataset is received.
- Data is saved directly to the node, and regular snapshots are streamed from the node back to the main pool.
  - If the node dies or is network partitioned, you can manually break the lease, losing any data since the last snapshot. With frequent snapshots this is still a pretty good outcome considering a node with the only copy of the leased dataset just vanished.
- When the pod is unscheduled and the containers using the dataset stop, it takes a final snapshot, sends it to the main pool and releases the lease.

Thoughts?

Mar 31 '22 17:03 infogulch

Hi @infogulch I'm so sorry for not responding until now. I must have missed a notification, I only saw this today for the first time ...

No, I didn't consider moving datasets between nodes yet. Though I can imagine that this would be a fairly complicated feature to achieve.

Depending on the dataset size, this can lead to very long startup times for pods since the data has to be moved first. Also, it's a slow process since the nodes don't know eachother, as the provisioner is using SSH to connect to single nodes. We would have to engineer something or otherwise ensure so that the nodes can talk to each other directly for faster throughput.
This approach seems like a rather complicated and error-prone approach. Especially if users run stateful applications, storage is more important than network and I don't want people to blame me or the provisioner in case people loose important data since last snapshot sent.

Generally, this provisioner isn't designed for large scale use cases. In fact I only use this privately in my home lab. As such, it's rather limited in features, but it's sufficient for my home lab with 2 nodes. Other projects like https://github.com/openebs/zfs-localpv offer advanced features. Oh, apparently this is also requested here: https://github.com/openebs/zfs-localpv/issues/291 and may sometime be implemented.

Regardless, why do you need moving data between? Surely you're aware that ZFS isn't designed for clustered systems. And there may be better and more tested alternatives for clustered storage, e.g. Rook, which is using Ceph.

May 20 '22 21:05 ccremer

There are lots of scenarios where you might really need to migrate pods to different nodes. A node coming down for maintenance for example. Or if you added a third node to your lab and you wanted to rebalance the pods so that some of them run on the new node. Even if it took 20m to transfer the data to the new node, at least it's possible. That's all I'm aiming at with scenario 1, converting from impossible to possible, if potentially slow.

Full mesh network topologies are commonplace, so I think it's reasonable to expect direct SSH connections between nodes to be in the cards provided ports are open. There is definitely some work to be done to manage SSH credentials for node to node connections.

I agree that the scenario 2 is a bit underdeveloped. Perhaps I'll flesh it out in the future, but we can ignore it for now.

I wasn't aware of openebs/zfs-localpv, thanks for the pointer. It's interesting that the same feature is requested there as well.

May 27 '22 06:05 infogulch

I understand the reasons why people want to move pods to different nodes. We are used to the self-heal capabilities from Kubernetes after all :)

On a small scale, moving pod is possible with this provisioner already when doing it manually. There are projects out there that help you migrate local-storage between nodes, for example https://github.com/utkuozdemir/pv-migrate, which is apparently backed by rsync. With this one you could provision a new PVC on the target node and copy the data (obviously without ZFS snapshots...). We could also think about integrating existing solutions than engineering something new entirely. Also we'd need a MoveDatasetPolicy parameter or something that prevents moving data, because in my case I wouldn't want to wait 2x20mins to move data to another node if a node restart is done in 3 minutes ;)

I'm not against the feature in general. It's just that I don't have the time to work on something this big. If anything, I think the first approach would be easier and safer to implement, but I rely on contributions here.

May 27 '22 07:05 ccremer

kubernetes-zfs-provisioner kubernetes-zfs-provisioner copied to clipboard

Send/recv datasets between nodes

kubernetes-zfs-provisioner
kubernetes-zfs-provisioner copied to clipboard