cloudstack icon indicating copy to clipboard operation
cloudstack copied to clipboard

Create template from snapshot parallel allocating same NFS secondary storage on agent

Open rp- opened this issue 1 year ago • 10 comments

ISSUE TYPE
  • Bug Report
COMPONENT NAME
Storage
CLOUDSTACK VERSION
4.19.0

But I assume also 4.18 is affected

CONFIGURATION

I could recreate this with Linstor primary storage and the config option: lin.backup.snapshots disabled I'm not sure yet which other storage combination could be used to trigger this.

OS / ENVIRONMENT

Ubuntu 22.04 non hyperconverged (2 storage nodes, 3 compute nodes (with agents)) setup.

SUMMARY

If I create 2 templates from snapshot (from different snapshots) at nearly the same time and the copy command is sent to the same agent, both copy commands use the same storage pool object and the 1. finished will try to unmount the still used storage pool (from the second copy command). In the worst case the first copy command is finished while the second didn't even start yet to copy, resulting in the second copy command to write into the local mount directory.

STEPS TO REPRODUCE
* Create 2 snapshots from the same volume.
* Create a template from the first snapshot
* Create a template from the second snapshot (nearly after the first)

If the copy commands run on the same agent you would see the secondary storage created twice. e.g.:
2024-04-10 12:03:13,081 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:1ec7c8ad) Attempting to create storage pool e9e15545-fceb-3d76-be34-37dfaf5af384 (NetworkFilesystem) in libvirt
2024-04-10 12:03:13,089 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:1ec7c8ad) Attempting to create storage pool e9e15545-fceb-3d76-be34-37dfaf5af384
2024-04-10 12:03:23,319 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:1fab4dd4) Attempting to create storage pool e9e15545-fceb-3d76-be34-37dfaf5af384 (NetworkFilesystem) in libvirt
2024-04-10 12:03:43,257 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:1ec7c8ad) Attempting to remove storage pool e9e15545-fceb-3d76-be34-37dfaf5af384 from libvirt
2024-04-10 12:03:48,862 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:1fab4dd4) Attempting to remove storage pool e9e15545-fceb-3d76-be34-37dfaf5af384 from libvirt

Attached are management and agent logs. management.log agent.log

EXPECTED RESULTS
Either running the copy commands in sequence or having a ref counted storage pool, that doesn't remove itself
while still used.

ACTUAL RESULTS
2 storage pool objects created, pointing to the same mount point.
The first finished would unmount the mount point for the other.

rp- avatar Apr 10 '24 12:04 rp-

@rp- , I used the scenario and a script

cmk create template name=1st ostypeid=a3bf2482-06d8-11ef-881c-1e0048000312 snapshotid=639c5966-2c93-4c05-91c4-38922f84d981 &
cmk create template name=2nd ostypeid=a3bf2482-06d8-11ef-881c-1e0048000312 snapshotid=09f04474-c261-4ad2-9148-482b5ccc2c45 &

but didn't reproduce the issue. I think it is credible though. By the looks of your description it shouldn't even be storage specific, but maybe there is a timing constrained that makes me not reproduce it.

Do you have a fix/idea/design to deal with this?

DaanHoogland avatar May 01 '24 12:05 DaanHoogland

@DaanHoogland I think it is kinda storage related, by the storage motion strategy CloudStack takes. In this specific Linstor case CloudStack has to first copy the Snapshot into the secondary storage as template (as the snapshot in this case is only Linstor storage) First I thought this is the only storage motion strategy to trigger that, but apparently a Customer reported that they still have problems with NFS mounts not correctly mounted/unmounted. So it might either be that other storage motions are also affected, or the Customer setup has some extra flaw which I can't figure out.

Well my idea would be the 2 points from the "Expected results", either it is done in sequence OR the storage pool are refcounted.

rp- avatar May 02 '24 08:05 rp-

Well my idea would be the 2 points from the "Expected results", either it is done in sequence OR the storage pool are refcounted.

@rp- ,

  • I do not like the in sequence idea as this is storage being copied over network this will no doubt be a performance hog.
  • keeping a mount-count could work, meaning that the SSVM would have to keep a tally of how many requests are using a certain mount, and on umount just returning success without action when the count is not zero. Only to really un-mount once the it is. One catcha is that the mount functionality must be centralised and may not be called by commands, but only requested. If we guarantee this it will work as you describe €0,02

DaanHoogland avatar May 02 '24 10:05 DaanHoogland

Hi @DaanHoogland @rp- ,

Just curious, does this issue happen with Ceph users? I havent heard of any so far,

So if its working with Ceph, means that Linstor is using a different method?

btzq avatar May 17 '24 01:05 btzq

@rp- I am moving this to 4.19.2 (or later) as we do not have a way forward yet.

DaanHoogland avatar Jun 18 '24 09:06 DaanHoogland

@DaanHoogland Here would be some way to fix this, but no feedback yet if it helps: https://github.com/LINBIT/cloudstack/commit/e53ea85cc882bf758c7fdcc839e0b848c27e570e

rp- avatar Jun 20 '24 06:06 rp-

@DaanHoogland Here would be some way to fix this, but no feedback yet if it helps: LINBIT@e53ea85

@rp- , I imagine you propose to move that logic into a more generic place like LibvirtStorageAdaptor or LibvirtStoragePool ?

DaanHoogland avatar Jun 20 '24 07:06 DaanHoogland

@DaanHoogland Here would be some way to fix this, but no feedback yet if it helps: LINBIT@e53ea85

@rp- , I imagine you propose to move that logic into a more generic place like LibvirtStorageAdaptor or LibvirtStoragePool ?

Yes, wherever if would fit best.

rp- avatar Jul 01 '24 09:07 rp-

@rp- , I would not know off the top of my head. The more I think about it the more locations seem apropriat :exploding_head: CloudStackImageStoreDriverImpl could also be a location to put this. @harikrishna-patnala @slavkap any ideas?

DaanHoogland avatar Jul 01 '24 10:07 DaanHoogland

@DaanHoogland, I think the current suggestion (the LibvirtStorageAdaptor) from @rp- is the right place. I haven't faced the issue but it seems valid for all storage pools that need to make copies on secondary storage at some point. I didn't do any research but the fix from Rene will probably handle some corner cases with unmounting NFS as a primary storage as well.

slavkap avatar Jul 01 '24 12:07 slavkap