cloudstack
cloudstack copied to clipboard
After cloudstack canceled volume live migration, VMWare still keeps on migrating this volume => Cloustack keeps old/wrong location of volume.
ISSUE TYPE
- Bug Report
COMPONENT NAME
API, Volume
CLOUDSTACK VERSION
4.17.1.0
CONFIGURATION
VMWare vSphere
OS / ENVIRONMENT
SUMMARY
I tried to live-migrate a 4TB-volume via following command in cloudmonkey. But after the given threshold of 120 minutes ("global settings" -> "job.cancel.threshold.minutes") this jobs has been canceled by cloudstack:
"jobresult": { "errorcode": 530, "errortext": "Unable to serialize: Job is cancelled as it has been blocking others for too long"
VMWare had new/correct location of volume, but cloustack keeps old/wrong location of volume.
STEPS TO REPRODUCE
* For simmulating, set "global setting" "job.cancel.threshold.minutes" timeperode to a very small value, or adjust volume-size
* Migrate a volume to another storage, what should be canceled during specified timeperode (Also via GUI possible):
CMK> migrate volume livemigrate=true storageid=xxxxg volumeid=xxxxxx
EXPECTED RESULTS
This migration in cloudstack has been "canceled" correctly, but migration in VMWare should have also been aborted.
ACTUAL RESULTS
VMWare finished the volume livemigration, but cloudstack does not know this new location.
If that corresponding virutal machine is shut down and startet again. this vm will be "orchestrated" with the wrong volume-informations and vm will not boot any longer.
Cloudstack-Error:
Unable to orchestrate start VM instance {id: "xxxx", name: "i-xxx-xxxxx-VM", uuid: "xxxx", type="User"} due to [Unable to start instance 'xxxxx' (xxxx), see management server log for details].
Thanks for opening your first issue here! Be sure to follow the issue template!
It looks like the job is cancelled in cloudstack, but no in vcenter, which causes inconsistent information.
It might be good to have a background thread to scan the storage pools (cloudstack might already have)
VMware client in CS supports cancel the migration volume, any other task if it's in cancelable state. It seems the jobs in the hypervisor are not cancelled, when the parent job is cancelled due to job cancel threshold time. Either the related hypervisor jobs have to cancelled if possible or the resources (VM, Volume) have to be sync-ed with their latest state sometime/delay after job is cancelled using background thread. Maybe, this needs proper function definition (requires detailed investigation) - what hypervisors to support, what jobs - cancellable or not, which resources to sync, cleanup required or not, any other actions to be taken, etc.
https://github.com/apache/cloudstack/blob/1383625c93e300c6b8d62b52ddfd090d3291fc74/vmware-base/src/main/java/com/cloud/hypervisor/vmware/util/VmwareClient.java#L785
https://github.com/apache/cloudstack/blob/1383625c93e300c6b8d62b52ddfd090d3291fc74/vmware-base/src/main/java/com/cloud/hypervisor/vmware/util/VmwareClient.java#L807-L814