Create template from snapshot parallel allocating same NFS secondary storage on agent
ISSUE TYPE
- Bug Report
COMPONENT NAME
Storage
CLOUDSTACK VERSION
4.19.0
But I assume also 4.18 is affected
CONFIGURATION
I could recreate this with Linstor primary storage and the config option: lin.backup.snapshots disabled
I'm not sure yet which other storage combination could be used to trigger this.
OS / ENVIRONMENT
Ubuntu 22.04 non hyperconverged (2 storage nodes, 3 compute nodes (with agents)) setup.
SUMMARY
If I create 2 templates from snapshot (from different snapshots) at nearly the same time and the copy command is sent to the same agent, both copy commands use the same storage pool object and the 1. finished will try to unmount the still used storage pool (from the second copy command). In the worst case the first copy command is finished while the second didn't even start yet to copy, resulting in the second copy command to write into the local mount directory.
STEPS TO REPRODUCE
* Create 2 snapshots from the same volume.
* Create a template from the first snapshot
* Create a template from the second snapshot (nearly after the first)
If the copy commands run on the same agent you would see the secondary storage created twice. e.g.:
2024-04-10 12:03:13,081 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:1ec7c8ad) Attempting to create storage pool e9e15545-fceb-3d76-be34-37dfaf5af384 (NetworkFilesystem) in libvirt
2024-04-10 12:03:13,089 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:1ec7c8ad) Attempting to create storage pool e9e15545-fceb-3d76-be34-37dfaf5af384
2024-04-10 12:03:23,319 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:1fab4dd4) Attempting to create storage pool e9e15545-fceb-3d76-be34-37dfaf5af384 (NetworkFilesystem) in libvirt
2024-04-10 12:03:43,257 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:1ec7c8ad) Attempting to remove storage pool e9e15545-fceb-3d76-be34-37dfaf5af384 from libvirt
2024-04-10 12:03:48,862 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:1fab4dd4) Attempting to remove storage pool e9e15545-fceb-3d76-be34-37dfaf5af384 from libvirt
Attached are management and agent logs. management.log agent.log
EXPECTED RESULTS
Either running the copy commands in sequence or having a ref counted storage pool, that doesn't remove itself
while still used.
ACTUAL RESULTS
2 storage pool objects created, pointing to the same mount point.
The first finished would unmount the mount point for the other.
@rp- , I used the scenario and a script
cmk create template name=1st ostypeid=a3bf2482-06d8-11ef-881c-1e0048000312 snapshotid=639c5966-2c93-4c05-91c4-38922f84d981 &
cmk create template name=2nd ostypeid=a3bf2482-06d8-11ef-881c-1e0048000312 snapshotid=09f04474-c261-4ad2-9148-482b5ccc2c45 &
but didn't reproduce the issue. I think it is credible though. By the looks of your description it shouldn't even be storage specific, but maybe there is a timing constrained that makes me not reproduce it.
Do you have a fix/idea/design to deal with this?
@DaanHoogland I think it is kinda storage related, by the storage motion strategy CloudStack takes. In this specific Linstor case CloudStack has to first copy the Snapshot into the secondary storage as template (as the snapshot in this case is only Linstor storage) First I thought this is the only storage motion strategy to trigger that, but apparently a Customer reported that they still have problems with NFS mounts not correctly mounted/unmounted. So it might either be that other storage motions are also affected, or the Customer setup has some extra flaw which I can't figure out.
Well my idea would be the 2 points from the "Expected results", either it is done in sequence OR the storage pool are refcounted.
Well my idea would be the 2 points from the "Expected results", either it is done in sequence OR the storage pool are refcounted.
@rp- ,
- I do not like the in sequence idea as this is storage being copied over network this will no doubt be a performance hog.
- keeping a mount-count could work, meaning that the SSVM would have to keep a tally of how many requests are using a certain mount, and on umount just returning success without action when the count is not zero. Only to really un-mount once the it is. One catcha is that the mount functionality must be centralised and may not be called by commands, but only requested. If we guarantee this it will work as you describe €0,02
Hi @DaanHoogland @rp- ,
Just curious, does this issue happen with Ceph users? I havent heard of any so far,
So if its working with Ceph, means that Linstor is using a different method?
@rp- I am moving this to 4.19.2 (or later) as we do not have a way forward yet.
@DaanHoogland Here would be some way to fix this, but no feedback yet if it helps: https://github.com/LINBIT/cloudstack/commit/e53ea85cc882bf758c7fdcc839e0b848c27e570e
@DaanHoogland Here would be some way to fix this, but no feedback yet if it helps: LINBIT@e53ea85
@rp- , I imagine you propose to move that logic into a more generic place like LibvirtStorageAdaptor or LibvirtStoragePool ?
@DaanHoogland Here would be some way to fix this, but no feedback yet if it helps: LINBIT@e53ea85
@rp- , I imagine you propose to move that logic into a more generic place like
LibvirtStorageAdaptororLibvirtStoragePool?
Yes, wherever if would fit best.
@rp- , I would not know off the top of my head. The more I think about it the more locations seem apropriat :exploding_head: CloudStackImageStoreDriverImpl could also be a location to put this.
@harikrishna-patnala @slavkap any ideas?
@DaanHoogland, I think the current suggestion (the LibvirtStorageAdaptor) from @rp- is the right place. I haven't faced the issue but it seems valid for all storage pools that need to make copies on secondary storage at some point. I didn't do any research but the fix from Rene will probably handle some corner cases with unmounting NFS as a primary storage as well.