cloudstack icon indicating copy to clipboard operation
cloudstack copied to clipboard

After cloudstack canceled volume live migration, VMWare still keeps on migrating this volume => Cloustack keeps old/wrong location of volume.

Open fabeulus opened this issue 2 years ago • 2 comments

ISSUE TYPE
  • Bug Report
COMPONENT NAME
API, Volume
CLOUDSTACK VERSION
4.17.1.0
CONFIGURATION

VMWare vSphere

OS / ENVIRONMENT
SUMMARY

I tried to live-migrate a 4TB-volume via following command in cloudmonkey. But after the given threshold of 120 minutes ("global settings" -> "job.cancel.threshold.minutes") this jobs has been canceled by cloudstack:

"jobresult": { "errorcode": 530, "errortext": "Unable to serialize: Job is cancelled as it has been blocking others for too long"

VMWare had new/correct location of volume, but cloustack keeps old/wrong location of volume.

STEPS TO REPRODUCE
* For simmulating, set "global setting" "job.cancel.threshold.minutes" timeperode to a very small value, or adjust volume-size
* Migrate a volume to another storage, what should be canceled during specified timeperode (Also via GUI possible):
              CMK> migrate volume livemigrate=true storageid=xxxxg volumeid=xxxxxx
EXPECTED RESULTS
This migration in cloudstack has been "canceled" correctly, but migration in VMWare should have also been aborted.
ACTUAL RESULTS
VMWare finished the volume livemigration, but cloudstack does not know this new location.
If that corresponding virutal machine is shut down and startet again. this vm will be "orchestrated" with the wrong volume-informations and vm will not boot any longer.

Cloudstack-Error:

Unable to orchestrate start VM instance {id: "xxxx", name: "i-xxx-xxxxx-VM", uuid: "xxxx", type="User"} due to [Unable to start instance 'xxxxx' (xxxx), see management server log for details].

fabeulus avatar Feb 03 '23 11:02 fabeulus

Thanks for opening your first issue here! Be sure to follow the issue template!

boring-cyborg[bot] avatar Feb 03 '23 11:02 boring-cyborg[bot]

It looks like the job is cancelled in cloudstack, but no in vcenter, which causes inconsistent information.

It might be good to have a background thread to scan the storage pools (cloudstack might already have)

weizhouapache avatar Feb 11 '23 09:02 weizhouapache

VMware client in CS supports cancel the migration volume, any other task if it's in cancelable state. It seems the jobs in the hypervisor are not cancelled, when the parent job is cancelled due to job cancel threshold time. Either the related hypervisor jobs have to cancelled if possible or the resources (VM, Volume) have to be sync-ed with their latest state sometime/delay after job is cancelled using background thread. Maybe, this needs proper function definition (requires detailed investigation) - what hypervisors to support, what jobs - cancellable or not, which resources to sync, cleanup required or not, any other actions to be taken, etc.

https://github.com/apache/cloudstack/blob/1383625c93e300c6b8d62b52ddfd090d3291fc74/vmware-base/src/main/java/com/cloud/hypervisor/vmware/util/VmwareClient.java#L785

https://github.com/apache/cloudstack/blob/1383625c93e300c6b8d62b52ddfd090d3291fc74/vmware-base/src/main/java/com/cloud/hypervisor/vmware/util/VmwareClient.java#L807-L814

sureshanaparti avatar Jun 11 '24 08:06 sureshanaparti