kvm-guest-drivers-windows
kvm-guest-drivers-windows copied to clipboard
viostor Reset to device, \Device\RaidPort3, was issued. VM is frozen.
Unfortunately we are experiencing the same issue like described in #623.
Error message "viostor Reset to device, \Device\RaidPort3, was issued." appears in the log and the VM is frozen / unresponsive. Only cold reboot helps. This happened already several times.
Environment:
- KVM / libvirt 3.0.0-4+deb9u5
- Storage: Ceph Nautilus 14.2.21, SSD pool
- Problem occurs on both Intel (Xeon E5-2643) and AMD (EPYC 7F72) compute nodes
Affected VMs: Windows server version: Windows Server 2016
- Virtio driver version: build 215
Windows server version: Windows Server 2016
- Virtio driver version: 100.77.104.17100
Windows server version: Windows Server 2012 R2
- Virtio driver version: 62.74.104.14100
It seems primarily servers running MS SQL Server are affected.
The error message means that the storage does not respond for 30 seconds. In Ceph we see nothing close to that, latency of the affected volume is good. Disk performance and throughput on the Ceph SSD storage pool is excellent when measured with CrystalDiskMark or the SQL Server-internal benchmark.
It seems the issue may occur at arbitrary times, i.e. it is not related to the overall load on the storage cluster. The VM often runs for a few minutes after the first "viostor Reset to device" entries appear in the log. The first application-level problems can appear even if the entire VM is not yet frozen.
On a volume on which this error occured once, we can reproduce it as follows:
- run CrystalDiskMark on that volume (or just copy files from that volume), while the volume is attached to the original server. The error appears in the log and the VM freezes within seconds.
When we can not reproduce this issue:
- If the affected volume is attached to another VM, we can not reproduce this problem. Running the same operation (e.g. CrystalDiskMark or copying files) on the same volume when attached to another VM the error does not occur.
- Running the same operation on a (newly created) volume on the initial VM does not reproduce the problem.
This means that apparently "only" this combination of volume and VM causes the problem.
We have been able to resolve this issue at least for some time, by copying the data from the affected volume to a new volume (while attached to another VM) and then attach the newly created volume to the initial VM.
It is working for a while before the error reoccurs.
It also seems that after the problem occurred for the very first time on a volume it then reoccurs more often (it is not always possible to immediately replace the affected volume as these are production servers).
Do you have any idea what could be causing this problem and how we can work around it? Thank you very much.
@mherlitzius he problem supposed to be fixed for both vioscsi and viostor with https://github.com/virtio-win/kvm-guest-drivers-windows/pull/684 and https://github.com/virtio-win/kvm-guest-drivers-windows/pull/683 commits respectively.
That is quite new strange that you can reproduce the problem on WS2016 with build 215 since it was tested and verified by our QE https://bugzilla.redhat.com/show_bug.cgi?id=2013976 and https://bugzilla.redhat.com/show_bug.cgi?id=2028000 Will it be possible to share the system event log file from WS2016 system?
Thanks, Vadim.
@vrozenfe Thank you for your timely reply. I will ask for the event log from the WS2016 machine running build 215. If this is fixed with build 215, is it enough to update the driver or should the drives or the whole VM be rebuilt?
@mherlitzius Update should be enough. However, you might be asked to reboot the VM after that.
@vrozenfe Please see attached the event log. The error just appeared this morning. srv-t-jdbib1.zip
@mherlitzius Thanks a lot. By any chance, did you change any of two parameters "TimeoutValue" or "IoTimeoutValue" mentioned in https://github.com/virtio-win/kvm-guest-drivers-windows/issues/623 ? I'm asking because the reset event is coming every 4 minutes instead of default 30 seconds. Another interesting thing is the bunch of 153 events. Can you please export the system event log file in .evtx format instead of .csv? It will make possible to check and analyse the SRB status returned by Ceph.
BEst, Vadim.
@vrozenfe There are no entries for Storport Miniport registry keys. There is an entry for SCSI miniport:
Can I provide you the event log via another channel as it contains customer specific information? Thanks.
Best, Matthias
Did you get any resolution to this? I have the same issue on Server 19/SQL 19 using VirtIO 0.1.229
I am also interested to know if there was any solution, or a planned new release of the virtio drivers. We have approx 90 windows VMs, and this issue only happens to 3 of them for some reason. (2 of them are windows 2019 server, and the last one is a windows 2022 server)
Let me know if I can provide any logs or configuration that can be useful to narrow down the issue. The freeze and error happens sporadically. Sometimes several times each week, and sometimes it can go weeks between occurrences. But only on these 3 VMs for some reason.
@melfacion @thinkingcap build 229 is completely up to date. If you didn't adjust "IoTimoutValue" https://learn.microsoft.com/en-us/windows-hardware/drivers/storage/registry-entries-for-storport-miniport-drivers - please do it, trying to increase it gradually (60/12/180) and checking if it makes any differences. Please note that VM needs to be restarted erery time this parameter has been changed. That three VMs. Are they running on the same or different hosts? Do they all use local or network storage? It will be nice if you can share Windows event log files with me at vrozenfe_at_redhat_dot_com.
Thank you, Vadim.
That three VMs. Are they running on the same or different hosts? Do they all use local or network storage? It will be nice if you can share Windows event log files with me at vrozenfe_at_redhat_dot_com.
Thank you, Vadim.
@vrozenfe : different hosts, different datacenters, different storage pools. They all use Ceph for network storage. Hypervisor is proxmox (ver 7.3-6 in production currently and 7.4-3 in test datacenter) (2 of the VMs are in our production datacenter and are running customer workloads. One is a QLIK Sense application, the other one is a WSUS update server. The third one in a different datacenter is one of our internal VMs used as a jumpserver/Windows terminal services for accessing our customers ) So the WSUS update server and our internal server can easily be restarted for troubleshooting whenever needed. The last one is a customer-facing solution, and should be up and running at all times.
I exported the events from all sources for some time before the last crash and some time after. Will send it to you as requested.
When I get the error Reset to device, \Device\RaidPort2, was issued.
I get the following in the host Proxmox syslog
QEMU[71107]: kvm: virtio: zero sized buffers are not allowed
Once I got this too with same impact on guest
QEMU[2198]: kvm: Desc next is 3
Hypervisor is Proxmox
proxmox-ve: 7.4-1 (running kernel: 6.2.11-2-pve)
pve-manager: 7.4-13 (running version: 7.4-13/46c37d9c)
pve-kernel-6.2: 7.4-3
pve-kernel-5.15: 7.4-3
pve-kernel-6.1: 7.3-6
pve-kernel-6.2.11-2-pve: 6.2.11-2
pve-kernel-6.1.15-1-pve: 6.1.15-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-4
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve
qm config
agent: 1
bios: ovmf
boot: order=scsi0;net0;ide0
cores: 12
cpu: Skylake-Server
description: #### Windows Server 2016%0A#### SQL Server 2019
efidisk0: SSD-R10:vm-201-disk-0,efitype=4m,format=raw,pre-enrolled-keys=1,size=528K
ide0: none,media=cdrom
machine: pc-q35-7.1
memory: 61440
meta: creation-qemu=7.1.0,ctime=1676181242
name: IRIS
net0: virtio=86:DB:0D:46:C1:58,bridge=vmbr0,firewall=1,tag=25
numa: 1
onboot: 1
ostype: win10
scsi0: SSD-R10:vm-201-disk-1,discard=on,format=raw,iothread=1,size=96G
scsi1: SSD-R10:vm-201-disk-2,discard=on,iothread=1,size=320G
scsi2: SSD-R10:vm-201-disk-3,discard=on,iothread=1,size=160G
scsi3: SSD-R10:vm-201-disk-4,discard=on,iothread=1,size=64G
scsi4: SSD-R10:vm-201-disk-5,discard=on,iothread=1,size=448G
scsihw: virtio-scsi-single
smbios1: uuid=973a474f-45aa-4caa-ab0e-bace1c0aa76e
sockets: 1
tags: ims;windows
vmgenid: 62178394-12fe-4df7-ae25-839471658f30
Workload is SQL Server / Analysis Server 2019 running ETL jobs and processing cubes. Ill take a look at IoTimoutValue - currently the reset message comes through every 60 seconds once I get the problem.
I actually get this on 2 hosts, this one more than the other. Both hosts are Dell servers, PERC hardware RAID 10, local storage using Intel DC SATA SSD's.
Since it is SQL server it might be useful to try reducing the maximum transfer size by specifying "PhysicalBreaks" key in the Registry https://access.redhat.com/solutions/6963779 This key works for virtio-scsi device only. (if I'm not mistaken WSUS can also use SQL engine)
Vadim.
Its protected content unfortunately.
@thinkingcap No problem. There is some open information in the driver code itself https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/vioscsi/vioscsi.c#L485 "PhysicalBreaks" Registry Key allows to redefine the maximum transfer size supported by vioscsi mini-port driver. Without this key the default value is 512, which make the maximum transfer size equal to 2MB https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/vioscsi/vioscsi.c#L501 IIRC on SQL server maximum transfer size can go up to 4MB which means that SQL server can generate 1MB/2MB load. 2MB block transfer work really good for direct LUN SSD backends but can be not optimal for some old or network connected storages. I would suggest setting "PhysicalBreaks" as 0x1f or 0x3f to see if it helps to fix the problem.
@vrozenfe I assume it requires a reboot when changing? Storage is local SATA SSD (Intel/Solidgm S4620) in RAID10 (Dell PERC)
@thinkingcap System disk requires reboot, for data disk disable/enable cycle is enough.
Hello, I'm also haunted by this problem. Running Proxmox 7.2 in production, one Windows Server 2022 VM is affected with MSSQL 2019 CU21 and virtio drivers 0.1.225. Storage backend for all VMs in the cluster is CEPH 16.2.7 (using 10Gbit SPF+, Jumbo Frames 9000 MTU). Sometimes the VM is running for weeks without issues, sometimes multiple lockups in a single day which require a hard reset.
Same symptoms:
- Windows Event Viewer shows multiple:
Reset to device \Device\RaidPort2 was issued.
each minute - On Proxmox host it shows after a few entries of the previous error:
kvm: virtio: zero sized buffers are not allowed
- Then the VM is locked up and requires a hard reset
@vrozenfe According to your suggestions I should first try setting these registry values, right? As you can see in the screenshot it happens exactly every 60 secs, which means my default IO timeout is 60secs right? So I would adjust it to 90sec for the vioscsi driver first?
-
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters
- IoTimeoutValue = 0x5a (90)
-
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device
- PhysicalBreaks = 0x3f (63)
Does 0x3f make sense for CEPH with 9000 MTU?
@MaxXor I would try to reduce the transfer size ("PhysicalBreaks") first to see if it helps to solve the problem. Btw, there is a different option to adjust the transfer size without touching the Windows Registry. Upstream and RHEL QEMU have "max_sectors" parameters for that. "-device virtio-scsi-pci,id=scsi-vioscsi1,max_sectors=63" will have the same effect as "PhysicalBreaks = 0x3f (63)". Unfortunately, I don't know if it works for Proxmox or not. You should probably clarify it with Proxmox support team.
I don't recall I've ever seen "kvm: virtio: zero sized buffers are not allowed" before. But since you and @thinkingcap both mentioned this message, by any chance can you guys post the QEMU version and the QEMU command line?
0x3f makes the maximum transfer size equal to 256K. In my understanding 30 sec (default timeout) period should be enough for Jumbo frame network storage to complete such transfer. Even for write operations, I think. Unfortunately I have very limited knowledge in CEPH, even though that we use it as one of the backends for our internal storage testing. Ideally, I would run some storage performance test inside the VM to compare latency and standard deviation for different transfer sizes (4K/16K/64K/256K).
Best, Vadim.
Thanks for your answer, unfortunately both IoTimeoutValue (0x5a) and PhysicalBreaks (0x3f) didn't help yet. I really don't understand it. Large file transfer, e.g. by copying a file, are working fine which should definitely put load on CEPH and cause higher IO latencies. It's copying with 500 MB/s, no errors. But if I keep the MSSQL running, monitoring the disk IO, it just locks up with low load (max. 30 MB/s) again after 1-2 hours.
I'm using the Proxmox QEMU package at version 7.2.0-8, which relates to this git commit: https://github.com/proxmox/pve-qemu/tree/93d558c1eef8f3ec76983cbe6848b0dc606ea5f1 It's the latest available for Proxmox 7.x.
Another thing I noticed, which might be worth mentioning is that only one of the vioscsi disks locks up. During the reset events in windows I can still read/write fine to all other disks, except the one which is locked up. The VM I'm using has 4 vioscsi disks.
I will post the QEMU command line later.
Did you reboot after applying those registry changes?
Yes. It's not a critical machine. I can reboot it at any time if it helps us figure out the root cause. :slightly_smiling_face:
Whats your VM config look like? Can you post a qm config <VMID>
agent: 1
balloon: 0
boot: order=scsi0
cores: 6
cpu: host
machine: pc-q35-6.2
memory: 32768
meta: creation-qemu=6.2.0,ctime=1657105335
net0: virtio=76:80:CE:B7:21:69,bridge=vmbr4
numa: 0
onboot: 1
ostype: win11
scsi0: ceph_storage:vm-104-disk-0,iothread=1,discard=on,size=100G
scsi1: ceph_storage:vm-104-disk-1,iothread=1,cache=writeback,discard=on,size=500G
scsi2: ceph_storage:vm-104-disk-2,iothread=1,cache=writeback,discard=on,size=500G
scsi3: ceph_storage:vm-104-disk-3,iothread=1,discard=on,size=500G
scsihw: virtio-scsi-single
smbios1: uuid=5839521f-0c23-431c-9806-c7ee8aab6104
sockets: 1
vmgenid: 1c6d5ca3-862b-49c5-848d-453cbe070164
Only scsi1 and scsi2 are affected (I already tried without cache=writeback, but it didn't help).
I would also change cache to Default (no cache)
I'm also not familiar with Ceph so unsure on performance impact.
Back to square 1 here, just got kvm: virtio: zero sized buffers are not allowed
again
To mitigate this problem, we switched to local storage. Unfortunately, we have not found any other solution either. Thanks for your help anyway, Vadim!
@thinkingcap I just noticed that the automatic trim operations for SSDs were running at the time of the last lockup. I disabled it for now to see if the problem disappears.
@mherlitzius I'm already using local SSD backed storage (from the start) @MaxXor Our issues occur during time of load when running SQL / AS jobs.
Proxmox 8.0.3 ran without issue for 5 days but got the less frequent error today:
QEMU[4927]: kvm: Desc next is 3
Error in VM is same.
So a different error message on the host but result is the same.
@thinkingcap I just noticed that the automatic trim operations for SSDs were running at the time of the last lockup. I disabled it for now to see if the problem disappears.
That process on my Server 2016 VM throws an error
The volume DATA (D:) was not optimized because an error was encountered: The slab consolidation / trim operation cannot be performed because the volume alignment is invalid. (0x89000029)
If I run from Powershell it succeeds.
The storage optimizer successfully completed retrim on DATA (D:)
Anecdotal evidence here but I checked out the change log for the 2 servers I have this issue on and noticed the more stable one had its PERC RAID controller cache mode changed to write-through and "no read ahead" on June 1. Havent seen this error on it now for close to 6 weeks. Updated the PERC controller to these same settings on July 3rd for server2 (date issue was last seen) and am at 8 days with no issues.
Update: 20-07-2023
QEMU[1935]: kvm: Desc next is 4
So 17 days without an issue.
From here
https://github.com/proxmox/mirror_qemu/blob/8844bb8d896595ee1d25d21c770e6e6f29803097/hw/virtio/virtio.c#L1055
Unsure if thats a VirtIO driver error or QEMU itself??