volsync
volsync copied to clipboard
VolumePopulator scheduling issue - CSI democrati-csi local-hostpath using the volsync volumepopulator.
Taken from conversation in https://github.com/backube/volsync/issues/1019.
If I'm understanding this correctly, this is about a storage driver that creates volume snapshots that are not portable across nodes.
When the volumepopulator is used to create a pvc from snapshot and volumebinding mode is WaitForFirstConsumer
, the PVC may get assigned to a node that is not compatible with the volumesnapshot, and restoring the snapshot fails.
> > 👋 this issue also happens with CSI democrati-csi local-hostpath using the volsync volumepopulator.
democratic-csi/democratic-csi#329 seems to be a time based racecondition.
@danielsand I don't think this issue was specifically about the volumepopulator - would you be able to explain the scenario where you're hitting the issue?
The linked issue wasnt about the volumepopulator, democrati csi local-hostpath + volume snapshots + volsync didnt worked for some folks.
Just a reference it on what was is currently running on my end and what is working. (CSI and volume snapshots work as they should)
Volumepopulator is failing at random currently on my setup. The wrong node gets picked by the volume populator and WaitForFirstConsumer is specified.
Will circle back when I push the topic again.
Originally posted by @danielsand in https://github.com/backube/volsync/issues/1019#issuecomment-2102539683
@danielsand please update if I've misunderstood, but I tried to put a summary above.
I'm guessing your issue is tied to the volumepopulator and that you have a volumesnapshot that is not portable across nodes.
What happens with the volumepopulator when you have a storageclass that uses WaitForFirstConsumer
is that the volumepopulator will not do anything until the PVC gets assigned to a node. The volumepopulator itself doesn't get involved with node assignment.
Essentially when a consumer wants to use the PVC ,it will get schduled to a node, and at that point the volume populator would try to provision a temp pvc with the snapshot contents on the same node that has been assigned to your original volumepopulator pvc. I think this can be an issue if the scheduler has chosen a node that is not compatible with your volume snapshot.
You may be able to work around this by using a storageclass with Immediate
for your volumepopulator PVC, at least as a test. This way the volumepopulator would immediately attempt to provision a PVC from the volumesnapshot - at that point the storage driver may set the node assignment since the volumesnapshot requires a specific node.
@tesshuflower kudos for pushing this and please assign the ticket to me. Working on the topic this week again and try your proposal. Will try to provide more solid input after some trial && error.
@tesshuflower spend 2 days on the issue.
the state i found february (?) where it worked sometimes and then not is not reproducible anymore. In fact currently the snapshots get not created since the volumes get not created and are in a pending state without much of logs or errors on all involved CSI components.
(each of them is pointing to the next one in logs... with no visible error)
Since yesterday csi-snapshotter 8.0.0 was released (with again breaking changes and rework of internal hooks...)
The validating logic for VolumeSnapshots, VolumeSnapshotContents, VolumeGroupSnapshots, and VolumeGroupSnapshotContents has been replaced by CEL validation rules. The validating webhook is now only being used for VolumeSnapshotClasses and VolumeGroupSnapshotClasses to ensure that there's at most one class per CSI Driver. The validation webhook is deprecated and will be removed in the next release. (https://github.com/kubernetes-csi/external-snapshotter/pull/1091, @leonardoce)
i will stop right here and wait what will come next... the haystack is just to big.
kudos to you @tesshuflower for your commitment and effort.
From my side the ticket is obsolete and can be closed.
cheers
thanks @danielsand and thanks for the info about the external-snapshotter, I do need to look into updating volsync tests to use this latest release. Will close this issue for now, but please re-open if you encounter this again going forward.