cloudstack
cloudstack copied to clipboard
Storage issues on VMware 8.0U1 (8.0.1.0)
To support VMware 8.0U1 (8.0.1.0), I made some manual database changes below
INSERT IGNORE INTO `cloud`.`hypervisor_capabilities` (uuid, hypervisor_type, hypervisor_version, max_guests_limit, security_group_enabled, max_data_volumes_limit, max_hosts_per_cluster, storage_motion_supported, vm_snapshot_enabled) values (UUID(), 'VMware', '8.0.1.0', 1024, 0, 59, 64, 1, 1);
and
INSERT IGNORE INTO `cloud`.`guest_os_hypervisor` (uuid,hypervisor_type, hypervisor_version, guest_os_name, guest_os_id, created, is_user_defined) SELECT UUID(),'VMware', '8.0.1.0', guest_os_name, guest_os_id, utc_timestamp(), 0 FROM `cloud`.`guest_os_hypervisor` WHERE hypervisor_type='VMware' AND hypervisor_version='8.0.0.1';
However, I faced many issues which are related to storage
-
System VMs and VRs are booted into read-only file system, but it works fine after soft reboot (ctrl+alt+delete) or hard reboot
-
Sometimes cannot power on VM, this mostly happens in the first vm deployment of a new template This has been addressed by a commit https://github.com/apache/cloudstack/pull/7380/commits/a2fcf0d66ad3962b61d5aa12a4b17b96a2cca840 in PR #7380
-
marvin test failure with test_internal_lb.py it works inside some VMs, but in some VMs there is error below
sshClient: DEBUG: {Cmd: /usr/bin/wget -T3 -qO- --user=admin --password=password http://10.1.2.12:8081/admin?stats via Host: 10.0.52.187} {returns: ["/usr/bin/wget: '/usr/lib/libpcre.so.1' is not an ELF file", "/usr/bin/wget: can't load library 'libpcre.so.1'"]}this has been addressed by a commit https://github.com/apache/cloudstack/pull/7380/commits/b1c08fddd6104fdd823411fbc1311fe2a136f307 in PR #7380 -
kubernetes control/worker nodes have read-only file system
-
kubernetes cluster is stuck at Starting
-
Error cloning VM from template in primary storage
2023-04-29 08:30:05,771 ERROR [c.c.s.r.VmwareStorageProcessor] (DirectAgent-285:ctx-a6342678 10.0.32.132, job-2661/job-2662, cmd: CopyCommand) (logid:1e91ee05) Error cloning VM from template in primary storage: %sUnable to access file /vmfs/volumes/e243b6f2-2c50ea8e/c86c7187-363a-4b41-baa1-267b78ccdc69/c86c7187-363a-4b41-baa1-267b78ccdc69-000001.vmdk since it is locked
java.lang.RuntimeException: Unable to access file /vmfs/volumes/e243b6f2-2c50ea8e/c86c7187-363a-4b41-baa1-267b78ccdc69/c86c7187-363a-4b41-baa1-267b78ccdc69-000001.vmdk since it is locked
at com.cloud.hypervisor.vmware.util.VmwareClient.waitForTask(VmwareClient.java:426)
at com.cloud.hypervisor.vmware.mo.VirtualMachineMO.createFullClone(VirtualMachineMO.java:856)
at com.cloud.storage.resource.VmwareStorageProcessor.createVMFullClone(VmwareStorageProcessor.java:772)
at com.cloud.storage.resource.VmwareStorageProcessor.cloneVMFromTemplate(VmwareStorageProcessor.java:3836)
ISSUE TYPE
- Bug Report
COMPONENT NAME
VMware
CLOUDSTACK VERSION
4.18 + manual DB changes
CONFIGURATION
OS / ENVIRONMENT
SUMMARY
STEPS TO REPRODUCE
EXPECTED RESULTS
ACTUAL RESULTS
regarding cks, it is worth to mention that
From ACS 4.16 onwards, if a CKS cluster is to be deployed on VMware, the 'vmware.create.full.clone' configuration parameter will need to be set to true, so as to allow resizing of root volumes of the cluster nodes.
by @rohityadavcloud
cc @weizhouapache @borisstoyanov @DaanHoogland @NuxRo I've started a discussion thread on VMware forum - https://communities.vmware.com/t5/ESXi-Discussions/VMware-disk-errors-when-booting-on-ESXi-8-0u1a/m-p/2980935#M289426
I tried to setup a mbx template and I can consistently reproduce issues with VMware 8.0u1a esxi (I think vcenter isn't an issue), I tried both NFS and local/datastore on ESXi 8.0u1 (using latest build/iso VMware-VMvisor-Installer-8.0U1a-21813344.x86_64.iso).
by @weizhouapache
I did few more testing and here are the results of actions: (1) register template and (2) deploy vm VCSA 8.0U1 and ESXi 8.0b: works Upgraded a host to 8.0c: works Upgraded a host to 8.1 U1
- If only 8.0 U1 host is enabled, does not work. If another 8.0c host is enabled. deployvm does not work either (on same primary storage)
- If only 8.0c host is enabled: works. If another 8.0U1 host is enabled. deployvm also works (on same primary storage)
- Thus, it looks like an issue in the ESXi upgrade (between 8.0c and 8.0 U1). The only difference is the host which handles the CopyCommand from secondary storage to primary storage.
2023-05-30 16:46:02,790 INFO [vmware.util.VmwareContext] (agentRequest-Handler-9:job-104/job-105, cmd: CopyCommand) Connected, conn: sun.net.www.protocol.https.DelegateHttpsURLConnection:https://10.0.32.197/nfc/52e7741f-89f7-ed9c-01df-eea4c3eb6911/disk-0.vmdk
I suspect if it is caused by a change in ESXi 8.0 U1, which might cause data loss during NFC (Network File Copy). see https://docs.vmware.com/en/VMware-vSphere/8.0/rn/vsphere-esxi-801-release-notes/index.html
New file type for OSDATA volumes on SSD devices: vSphere 8.0 Update 1 adds a new file system type, VMFSOS, specifically for the ESX-OSData system partition on local SSD devices, which allows you to continue using virtual flash resources on other devices. The new file type prevents cases when you format an ESX-OSData volume on a local SSD device, and fsType returns a file of type Virtual Flash File System (VFFS). As a result, the disk backing of the ESX-OSData volume is listed under the Virtual Flash resources in vCenter, but such a disk belongs to the ESX-OSData volume and is not a part of the Virtual Flash resource pool.
@weizhouapache there's a new comment from a community member on the vmware community thread:
The issue you are experiencing is likely due to a change in the way that vSphere 8.0u1 handles storage. In 8.0u1, vSphere uses a new format for VMDK files, which is not compatible with older versions of vSphere. This is why you are not seeing the issue when you use 8.0 or older versions of vSphere.
There are a few things you can do to work around this issue:
You can upgrade your Apache CloudStack to a version that is compatible with vSphere 8.0u1.
You can create a new VMDK file in the older format. To do this, you will need to use the qemu-img command. For example, to create a 10GB VMDK file in the older format, you would use the following command:
qemu-img create -f raw /tmp/vmdk.raw 10G
Once you have created the new VMDK file, you can attach it to your VM and boot it up.
There also seems to be a new 8.0u2 release https://core.vmware.com/resource/whats-new-vsphere-8-update-2 ?
I'm having the same problem. Has anyone found a solution?
I'm having the same problem. Has anyone found a solution?
@omurozlu no, we will revisit 8.0U1 and 8.0U2 support in 4.19.1/4.20.0.
this issue still exist in 4.19.0.0 RC4, so VMware 8.0.1 is still unsupported.
cc @vladimirpetrov @shwstppr
I have done some tests with CS 4.18.1, ESXi 8.0u2 (ESXi-8.0U2-22380479-standard) and vCenter 8.0u2 and it looks like it's somewhat working. My environment is hosted in KVM and for the VMs to not have issues with the underlying storage I had to use RAW disk format instead of QCOW2. There are still issues though when deploying templates but they are intermittent which is odd, sometimes it doesn't work for 8 attempts in a row then it works (disk locked issue).
After a while I started getting read-only filesystem errors on my routers though.
Is there anything that CloudStack can do to fix this issue? Where exactly is the problem?
I have done some tests with CS 4.18.1, ESXi 8.0u2 (ESXi-8.0U2-22380479-standard) and vCenter 8.0u2 and it looks like it's somewhat working. My environment is hosted in KVM and for the VMs to not have issues with the underlying storage I had to use RAW disk format instead of QCOW2. There are still issues though when deploying templates but they are intermittent which is odd, sometimes it doesn't work for 8 attempts in a row then it works (disk locked issue).
![]()
After a while I started getting read-only filesystem errors on my routers though.
Is there anything that CloudStack can do to fix this issue? Where exactly is the problem?
@alexandru-bagu I have investigated the issue for some days last year, unfortunately I could not get the root cause and find a fix. Early this year I tested 4.20.0.0-SNAPSHOT with the new Debian12 systemvm template, see #8497, surprisingly it worked. I suspect the issue was caused by some linux kernel changes, I cannot confirm it. If it is true, some user vms might be impacted as well.
I suggest to use vmware 8.0, not 80u1/u2 which are not officially supported by ACS 4.19 but might be supported in ACS 4.20.
this issue still exist in 4.19.0.0 RC4, so VMware 8.0.1 is still unsupported.
cc @vladimirpetrov @shwstppr
I was able to workaround this problem forcing a fsck on first image boot. To do this you need to do a "touch /forcefsck"
It worked on 8.0u3 with vsan esa.
Detailed description bellow:
tar -xf systemvmtemplate-4.19.1-vmware.ova
qemu-img convert -O raw systemvmtemplate-4.19.1-vmware-disk1.vmdk your_file.raw
losetup -fP your_file.raw
mount /dev/loop0p6 /mnt
touch /mnt/forcefsck
umount /mnt
losetup -d /dev/loop0
qemu-img convert -O vmdk -o subformat=streamOptimized your_file.raw systemvmtemplate-4.19.1-vmware-disk1.vmdk
ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk
# Changed ovf:size with the new image size in bytes from "ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk" inside file systemvmtemplate-4.19.1-vmware.ovf
sha256sum --tag systemvmtemplate-4.19.1-vmware.ovf > systemvmtemplate-4.19.1-vmware.mf
sha256sum --tag systemvmtemplate-4.19.1-vmware-disk1.vmdk >> systemvmtemplate-4.19.1-vmware.mf
tar -cvf systemvmtemplate-4.19.1-vmware.ova systemvmtemplate-4.19.1-vmware.ovf systemvmtemplate-4.19.1-vmware-disk1.vmdk systemvmtemplate-4.19.1-vmware.mf
# Published a new template image inside UI
# Changed global settings router.template.vmware to new image name
this issue still exist in 4.19.0.0 RC4, so VMware 8.0.1 is still unsupported.
cc @vladimirpetrov @shwstppr
I was able to workaround this problem forcing a fsck on first image boot. To do this you need to do a "touch /forcefsck"
It worked on 8.0u3 with vsan esa.
Detailed description bellow:
tar -xf systemvmtemplate-4.19.1-vmware.ova qemu-img convert -O raw systemvmtemplate-4.19.1-vmware-disk1.vmdk your_file.raw losetup -fP your_file.raw mount /dev/loop0p6 /mnt touch /mnt/forcefsck umount /mnt losetup -d /dev/loop0 qemu-img convert -O vmdk -o subformat=streamOptimized your_file.raw systemvmtemplate-4.19.1-vmware-disk1.vmdk ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk # Changed ovf:size with the new image size in bytes from "ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk" inside file systemvmtemplate-4.19.1-vmware.ovf sha256sum --tag systemvmtemplate-4.19.1-vmware.ovf > systemvmtemplate-4.19.1-vmware.mf sha256sum --tag systemvmtemplate-4.19.1-vmware-disk1.vmdk >> systemvmtemplate-4.19.1-vmware.mf tar -cvf systemvmtemplate-4.19.1-vmware.ova systemvmtemplate-4.19.1-vmware.ovf systemvmtemplate-4.19.1-vmware-disk1.vmdk systemvmtemplate-4.19.1-vmware.mf # Published a new template image inside UI # Changed global settings router.template.vmware to new image name
thanks a lot for the update @leolns
how long fsck take in your environment ?
this issue still exist in 4.19.0.0 RC4, so VMware 8.0.1 is still unsupported.
cc @vladimirpetrov @shwstppr
I was able to workaround this problem forcing a fsck on first image boot. To do this you need to do a "touch /forcefsck" It worked on 8.0u3 with vsan esa. Detailed description bellow:
tar -xf systemvmtemplate-4.19.1-vmware.ova qemu-img convert -O raw systemvmtemplate-4.19.1-vmware-disk1.vmdk your_file.raw losetup -fP your_file.raw mount /dev/loop0p6 /mnt touch /mnt/forcefsck umount /mnt losetup -d /dev/loop0 qemu-img convert -O vmdk -o subformat=streamOptimized your_file.raw systemvmtemplate-4.19.1-vmware-disk1.vmdk ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk # Changed ovf:size with the new image size in bytes from "ls -la systemvmtemplate-4.19.1-vmware-disk1.vmdk" inside file systemvmtemplate-4.19.1-vmware.ovf sha256sum --tag systemvmtemplate-4.19.1-vmware.ovf > systemvmtemplate-4.19.1-vmware.mf sha256sum --tag systemvmtemplate-4.19.1-vmware-disk1.vmdk >> systemvmtemplate-4.19.1-vmware.mf tar -cvf systemvmtemplate-4.19.1-vmware.ova systemvmtemplate-4.19.1-vmware.ovf systemvmtemplate-4.19.1-vmware-disk1.vmdk systemvmtemplate-4.19.1-vmware.mf # Published a new template image inside UI # Changed global settings router.template.vmware to new image namethanks a lot for the update @leolns
how long
fscktake in your environment ?
It took only a few seconds to run and it only runs on the first boot.
this has been fixed by #9625 tested ok on vmware 8.0.1/8.0.2/8.0.3 #9591
