cloudstack icon indicating copy to clipboard operation
cloudstack copied to clipboard

Re-deploy SystemVMs not working

Open UAnton opened this issue 1 year ago • 17 comments

ISSUE TYPE
  • Bug Report
COMPONENT NAME
SystemVM
CLOUDSTACK VERSION
14.19.0.1
CONFIGURATION
OS / ENVIRONMENT
SUMMARY

If I remove System VMs, new VMs don't start

STEPS TO REPRODUCE
Trying to recreate System VMs and get error.

```
2024-07-01 12:41:14,293 WARN  [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-38:ctx-d0153ddf job-3/job-75 ctx-37b9fb37) (logid:1974eba6) Unable to orchestrate start VM instance {"id":10,"instanceName":"s-10-VM","type":"SecondaryStorageVm","uuid":"bb6ad9d5-0d11-4897-85d4-3434ac349362"} due to [Create volume from template (ID = 3) failed: Creation of resource definition 'cs-ead4c0e7-cfdf-4ea6-b289-95dcffcf3245' failed due to an unknown exception.].
com.cloud.utils.exception.CloudRuntimeException: Create volume from template (ID = 3) failed: Creation of resource definition 'cs-ead4c0e7-cfdf-4ea6-b289-95dcffcf3245' failed due to an unknown exception.
        at org.apache.cloudstack.storage.motion.StorageSystemDataMotionStrategy.handleCreateVolumeFromTemplateBothOnStorageSystem(StorageSystemDataMotionStrategy.java:1499)
        at org.apache.cloudstack.storage.motion.StorageSystemDataMotionStrategy.handleCopyAsyncForTemplateAndVolume(StorageSystemDataMotionStrategy.java:396)
        at org.apache.cloudstack.storage.motion.StorageSystemDataMotionStrategy.copyAsync(StorageSystemDataMotionStrategy.java:331)
        at org.apache.cloudstack.storage.motion.DataMotionServiceImpl.copyAsync(DataMotionServiceImpl.java:87)
        at org.apache.cloudstack.storage.motion.DataMotionServiceImpl.copyAsync(DataMotionServiceImpl.java:116)
        at org.apache.cloudstack.storage.volume.VolumeServiceImpl.createVolumeFromBaseImageAsync(VolumeServiceImpl.java:790)
        at org.apache.cloudstack.storage.volume.VolumeServiceImpl.createVolumeFromTemplateAsync(VolumeServiceImpl.java:1595)
        at org.apache.cloudstack.engine.orchestration.VolumeOrchestrator.recreateVolume(VolumeOrchestrator.java:1799)
        at org.apache.cloudstack.engine.orchestration.VolumeOrchestrator.prepare(VolumeOrchestrator.java:1923)
        at jdk.internal.reflect.GeneratedMethodAccessor328.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
        at com.sun.proxy.$Proxy227.prepare(Unknown Source)
        at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMachineManagerImpl.java:1268)
        at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMachineManagerImpl.java:5457)
        at jdk.internal.reflect.GeneratedMethodAccessor318.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.cloud.vm.VmWorkJobHandlerProxy.handleVmWorkJob(VmWorkJobHandlerProxy.java:107)
        at com.cloud.vm.VirtualMachineManagerImpl.handleVmWorkJob(VirtualMachineManagerImpl.java:5581)
        at com.cloud.vm.VmWorkJobDispatcher.runJob(VmWorkJobDispatcher.java:102)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:654)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:48)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:102)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:45)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:602)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
```

EXPECTED RESULTS
New VMs are created and started
ACTUAL RESULTS
New VMs are created but without disks and don't start.

UAnton avatar Jul 01 '24 09:07 UAnton

@UAnton , what is your hypervisor type?? and can you find any errors in the host logs ?

DaanHoogland avatar Jul 01 '24 11:07 DaanHoogland

@DaanHoogland, I use KVM

Jul  1 11:35:28 node1 java[19561]: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'v-13-VM'
Jul  1 11:35:28 node1 java[19561]: libvirt: LXC Driver error : Domain not found: No domain with matching name 'v-13-VM'
Jul  1 11:35:28 node1 java[19561]: WARN  [kvm.resource.LibvirtConnection] (agentRequest-Handler-5:) (logid:2c282009) Can not find a connection for Instance v-13-VM. Assuming the default connection.
Jul  1 11:35:28 node1 java[19561]: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'v-13-VM'
Jul  1 11:35:28 node1 java[19561]: message repeated 4 times: [ libvirt: QEMU Driver error : Domain not found: no domain with matching name 'v-13-VM']
Jul  1 11:35:28 node1 java[19561]: WARN  [kvm.resource.LibvirtKvmAgentHook] (agentRequest-Handler-5:) (logid:2c282009) Groovy script '/etc/cloudstack/agent/hooks/libvirt-vm-state-change.groovy' is not available. Transformations will not be applied.
Jul  1 11:35:28 node1 java[19561]: WARN  [kvm.resource.LibvirtKvmAgentHook] (agentRequest-Handler-5:) (logid:2c282009) Groovy scripting engine is not initialized. Data transformation skipped.
Jul  1 11:35:40 node1 Controller[2227]: 11:35:40.950 [grizzly-http-server-46] INFO  LINSTOR/Controller - SYSTEM - REST/API RestClient(10.100.7.2; 'Swagger-Codegen/1.0.0/java')/CloneRscDfn
Jul  1 11:35:40 node1 Controller[2227]: 11:35:40.951 [grizzly-http-server-46] ERROR LINSTOR/Controller - SYSTEM - The creation of a new resource definition failed due to a name collision
Jul  1 11:35:40 node1 Controller[2227]: 11:35:40.962 [grizzly-http-server-47] INFO  LINSTOR/Controller - SYSTEM - REST/API RestClient(10.100.7.2; 'Swagger-Codegen/1.0.0/java')/DelRscDfn
Jul  1 11:35:40 node1 Controller[2227]: 11:35:40.962 [grizzly-http-server-47] WARN  LINSTOR/Controller - SYSTEM - Resource definition: cs-null not found.

UAnton avatar Jul 01 '24 11:07 UAnton

@DaanHoogland, I use KVM

Jul  1 11:35:28 node1 java[19561]: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'v-13-VM'
Jul  1 11:35:28 node1 java[19561]: libvirt: LXC Driver error : Domain not found: No domain with matching name 'v-13-VM'
Jul  1 11:35:28 node1 java[19561]: WARN  [kvm.resource.LibvirtConnection] (agentRequest-Handler-5:) (logid:2c282009) Can not find a connection for Instance v-13-VM. Assuming the default connection.
Jul  1 11:35:28 node1 java[19561]: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'v-13-VM'
Jul  1 11:35:28 node1 java[19561]: message repeated 4 times: [ libvirt: QEMU Driver error : Domain not found: no domain with matching name 'v-13-VM']
Jul  1 11:35:28 node1 java[19561]: WARN  [kvm.resource.LibvirtKvmAgentHook] (agentRequest-Handler-5:) (logid:2c282009) Groovy script '/etc/cloudstack/agent/hooks/libvirt-vm-state-change.groovy' is not available. Transformations will not be applied.
Jul  1 11:35:28 node1 java[19561]: WARN  [kvm.resource.LibvirtKvmAgentHook] (agentRequest-Handler-5:) (logid:2c282009) Groovy scripting engine is not initialized. Data transformation skipped.
Jul  1 11:35:40 node1 Controller[2227]: 11:35:40.950 [grizzly-http-server-46] INFO  LINSTOR/Controller - SYSTEM - REST/API RestClient(10.100.7.2; 'Swagger-Codegen/1.0.0/java')/CloneRscDfn
Jul  1 11:35:40 node1 Controller[2227]: 11:35:40.951 [grizzly-http-server-46] ERROR LINSTOR/Controller - SYSTEM - The creation of a new resource definition failed due to a name collision
Jul  1 11:35:40 node1 Controller[2227]: 11:35:40.962 [grizzly-http-server-47] INFO  LINSTOR/Controller - SYSTEM - REST/API RestClient(10.100.7.2; 'Swagger-Codegen/1.0.0/java')/DelRscDfn
Jul  1 11:35:40 node1 Controller[2227]: 11:35:40.962 [grizzly-http-server-47] WARN  LINSTOR/Controller - SYSTEM - Resource definition: cs-null not found.

@rp- , can you explain this log output?

DaanHoogland avatar Jul 01 '24 14:07 DaanHoogland

@DaanHoogland, I use KVM

Jul  1 11:35:28 node1 java[19561]: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'v-13-VM'
Jul  1 11:35:28 node1 java[19561]: libvirt: LXC Driver error : Domain not found: No domain with matching name 'v-13-VM'
Jul  1 11:35:28 node1 java[19561]: WARN  [kvm.resource.LibvirtConnection] (agentRequest-Handler-5:) (logid:2c282009) Can not find a connection for Instance v-13-VM. Assuming the default connection.
Jul  1 11:35:28 node1 java[19561]: libvirt: QEMU Driver error : Domain not found: no domain with matching name 'v-13-VM'
Jul  1 11:35:28 node1 java[19561]: message repeated 4 times: [ libvirt: QEMU Driver error : Domain not found: no domain with matching name 'v-13-VM']
Jul  1 11:35:28 node1 java[19561]: WARN  [kvm.resource.LibvirtKvmAgentHook] (agentRequest-Handler-5:) (logid:2c282009) Groovy script '/etc/cloudstack/agent/hooks/libvirt-vm-state-change.groovy' is not available. Transformations will not be applied.
Jul  1 11:35:28 node1 java[19561]: WARN  [kvm.resource.LibvirtKvmAgentHook] (agentRequest-Handler-5:) (logid:2c282009) Groovy scripting engine is not initialized. Data transformation skipped.
Jul  1 11:35:40 node1 Controller[2227]: 11:35:40.950 [grizzly-http-server-46] INFO  LINSTOR/Controller - SYSTEM - REST/API RestClient(10.100.7.2; 'Swagger-Codegen/1.0.0/java')/CloneRscDfn
Jul  1 11:35:40 node1 Controller[2227]: 11:35:40.951 [grizzly-http-server-46] ERROR LINSTOR/Controller - SYSTEM - The creation of a new resource definition failed due to a name collision
Jul  1 11:35:40 node1 Controller[2227]: 11:35:40.962 [grizzly-http-server-47] INFO  LINSTOR/Controller - SYSTEM - REST/API RestClient(10.100.7.2; 'Swagger-Codegen/1.0.0/java')/DelRscDfn
Jul  1 11:35:40 node1 Controller[2227]: 11:35:40.962 [grizzly-http-server-47] WARN  LINSTOR/Controller - SYSTEM - Resource definition: cs-null not found.

@rp- , can you explain this log output?

"Funny" I just fixed the Jul 1 11:35:40 node1 Controller[2227]: 11:35:40.962 [grizzly-http-server-47] WARN LINSTOR/Controller - SYSTEM - Resource definition: cs-null not found. bug

This happens, while CloudStack tries to delete a volume that wasn't correctly created, but as it didn't get the volume.setPath() called, the deleteAsync tries to delete a resource with the name cs-null.

I'm right now testing the 3 loc fix and will open a PR soon.

The question that remains is, why was the first volume creation not successful.

A workaround meanwhile is to delete the Linstor resource that is trying to be created with the same name. The uuid should be somewhere next to the name collision errors:

linstor rd d cs-UUID

rp- avatar Jul 01 '24 15:07 rp-

@rp- root@node1:~# linstor rd l

╭────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                            ┊ Port ┊ ResourceGroup ┊ State ┊
╞════════════════════════════════════════════════════════════════════════╡
┊ cs-03ea9c1c-fdf7-41e3-9670-279b1086adbf ┊ 7005 ┊ vdiskrg       ┊ ok    ┊
┊ cs-9dcca1bc-f31d-4c24-ae2b-dba6d0e0d2ce ┊ 7009 ┊ vdiskrg       ┊ ok    ┊
┊ cs-94e79c39-01c7-418a-8a58-c38322b98903 ┊ 7004 ┊ vdiskrg       ┊ ok    ┊
┊ cs-589fab5f-25b1-48ad-8c68-990cbbecc6de ┊ 7002 ┊ vdiskrg       ┊ ok    ┊
┊ cs-bb428f3a-309e-11ef-8e9c-005056a8c932 ┊ 7000 ┊ vdiskrg       ┊ ok    ┊
┊ cs-c2ed74b7-371c-4d01-b934-10c464e35647 ┊ 7010 ┊ vdiskrg       ┊ ok    ┊
┊ cs-c937df7a-0dff-43e7-8635-a1e64eae6415 ┊ 7008 ┊ vdiskrg       ┊ ok    ┊
┊ cs-cd8eaf3e-f340-4c5e-a06b-163851723b98 ┊ 7001 ┊ vdiskrg       ┊ ok    ┊
┊ cs-d0f3747e-ab5f-4be6-939d-88214e70d228 ┊ 7006 ┊ vdiskrg       ┊ ok    ┊
┊ cs-dc078cfc-d07f-4993-a020-bbfcf7a40379 ┊ 7003 ┊ vdiskrg       ┊ ok    ┊
┊ cs-ead4c0e7-cfdf-4ea6-b289-95dcffcf3245 ┊ 7007 ┊ vdiskrg       ┊ ok    ┊

I try to delete but this doesn't help. New resources were created after deploying the new VM and got the same error.

UAnton avatar Jul 01 '24 15:07 UAnton

Then on the first attempt, there must be a different error than the name collision error. And we would need to fix this.

rp- avatar Jul 01 '24 15:07 rp-

I will delete all resources from storage and try to redeploy VMs.

UAnton avatar Jul 01 '24 15:07 UAnton

Have the SystemVM's worked already? or did they never boot correctly?

rp- avatar Jul 01 '24 15:07 rp-

@rp- They are not created and boot correctly. If I don't press Rebbot - any resources aren't created in Linstor. This is the log from agent.log after I deleted SystemVM. Cloudstack tries to deploy the new.

2024-07-01 15:57:14,051 INFO  [kvm.resource.LibvirtConnection] (agentRequest-Handler-1:null) (logid:1974eba6) No existing libvirtd connection found. Opening a new one
2024-07-01 15:57:14,053 WARN  [kvm.resource.LibvirtConnection] (agentRequest-Handler-1:null) (logid:1974eba6) Can not find a connection for Instance s-20-VM. Assuming the default connection.
2024-07-01 15:57:14,161 WARN  [kvm.resource.LibvirtKvmAgentHook] (agentRequest-Handler-1:null) (logid:1974eba6) Groovy script '/etc/cloudstack/agent/hooks/libvirt-vm-state-change.groovy' is not available. Transformations will not be applied.
2024-07-01 15:57:14,162 WARN  [kvm.resource.LibvirtKvmAgentHook] (agentRequest-Handler-1:null) (logid:1974eba6) Groovy scripting engine is not initialized. Data transformation skipped.
2024-07-01 15:57:14,224 INFO  [kvm.resource.LibvirtConnection] (agentRequest-Handler-3:null) (logid:b148d17a) No existing libvirtd connection found. Opening a new one
2024-07-01 15:57:14,225 WARN  [kvm.resource.LibvirtConnection] (agentRequest-Handler-3:null) (logid:b148d17a) Can not find a connection for Instance v-21-VM. Assuming the default connection.
2024-07-01 15:57:14,336 WARN  [kvm.resource.LibvirtKvmAgentHook] (agentRequest-Handler-3:null) (logid:b148d17a) Groovy script '/etc/cloudstack/agent/hooks/libvirt-vm-state-change.groovy' is not available. Transformations will not be applied.
2024-07-01 15:57:14,336 WARN  [kvm.resource.LibvirtKvmAgentHook] (agentRequest-Handler-3:null) (logid:b148d17a) Groovy scripting engine is not initialized. Data transformation skipped.

UAnton avatar Jul 01 '24 16:07 UAnton

@rp- They are not created and boot correctly. If I don't press Rebbot - any resources aren't created in Linstor. This is the log from agent.log after I deleted SystemVM. Cloudstack tries to deploy the new.

2024-07-01 15:57:14,051 INFO  [kvm.resource.LibvirtConnection] (agentRequest-Handler-1:null) (logid:1974eba6) No existing libvirtd connection found. Opening a new one
2024-07-01 15:57:14,053 WARN  [kvm.resource.LibvirtConnection] (agentRequest-Handler-1:null) (logid:1974eba6) Can not find a connection for Instance s-20-VM. Assuming the default connection.
2024-07-01 15:57:14,161 WARN  [kvm.resource.LibvirtKvmAgentHook] (agentRequest-Handler-1:null) (logid:1974eba6) Groovy script '/etc/cloudstack/agent/hooks/libvirt-vm-state-change.groovy' is not available. Transformations will not be applied.
2024-07-01 15:57:14,162 WARN  [kvm.resource.LibvirtKvmAgentHook] (agentRequest-Handler-1:null) (logid:1974eba6) Groovy scripting engine is not initialized. Data transformation skipped.
2024-07-01 15:57:14,224 INFO  [kvm.resource.LibvirtConnection] (agentRequest-Handler-3:null) (logid:b148d17a) No existing libvirtd connection found. Opening a new one
2024-07-01 15:57:14,225 WARN  [kvm.resource.LibvirtConnection] (agentRequest-Handler-3:null) (logid:b148d17a) Can not find a connection for Instance v-21-VM. Assuming the default connection.
2024-07-01 15:57:14,336 WARN  [kvm.resource.LibvirtKvmAgentHook] (agentRequest-Handler-3:null) (logid:b148d17a) Groovy script '/etc/cloudstack/agent/hooks/libvirt-vm-state-change.groovy' is not available. Transformations will not be applied.
2024-07-01 15:57:14,336 WARN  [kvm.resource.LibvirtKvmAgentHook] (agentRequest-Handler-3:null) (logid:b148d17a) Groovy scripting engine is not initialized. Data transformation skipped.

I would be more interested in the management server log and also if the template volume contains a correct systemvm image. Maybe it wasn't copied correctly and is now corrupt.

linstor rd l -s Aux/cs-name Aux/cs-vm-name

helps to find the systemvm image resource name

rp- avatar Jul 01 '24 16:07 rp-

Screenshot 2024-07-01 at 19 17 11

UAnton avatar Jul 01 '24 16:07 UAnton

Screenshot 2024-07-01 at 19 19 51

UAnton avatar Jul 01 '24 16:07 UAnton

(v-21-VM) Unable to orchestrate start VM instance {"id":21,"instanceName":"v-21-VM","type":"ConsoleProxy","uuid":"2fd470d1-7a05-42eb-bcda-01d6bb5ba8d3"} due to [Create volume from template (ID = 3) failed: Resource definition 'cs-bb428f3a-309e-11ef-8e9c-005056a8c932' not found.].

UAnton avatar Jul 01 '24 17:07 UAnton

(v-21-VM) Unable to orchestrate start VM instance {"id":21,"instanceName":"v-21-VM","type":"ConsoleProxy","uuid":"2fd470d1-7a05-42eb-bcda-01d6bb5ba8d3"} due to [Create volume from template (ID = 3) failed: Resource definition 'cs-bb428f3a-309e-11ef-8e9c-005056a8c932' not found.].

I'm afraid this is now, because you deleted the template copy and CloudStack doesn't know that it isn't there anymore. I think the easier recover here is to delete the current Linstor primary storage and create a new one.

rp- avatar Jul 01 '24 17:07 rp-

How can I delete storage? I don't see a Delete button.

UAnton avatar Jul 01 '24 17:07 UAnton

How can I delete storage? I don't see a Delete button.

You have to put it into maintenance first.

rp- avatar Jul 01 '24 18:07 rp-

Thanks! Now all ok!

UAnton avatar Jul 01 '24 19:07 UAnton

fixed in #9325

DaanHoogland avatar Jul 11 '24 08:07 DaanHoogland