kvm-guest-drivers-windows [vioscsi] Fix SendSRB regression and refactor for optimum performance [viostor] Backport SendSRB improvements

Fixes regression in SendSRB of [vioscsi]

Related issues: #756 #623 #907 ... [likely others]

Regression was introduced in #684 between https://github.com/virtio-win/kvm-guest-drivers-windows/commit/7dc052d55057c2a4dc2f64639d2f5ad6de34c4af and https://github.com/virtio-win/kvm-guest-drivers-windows/commit/fdf56ddfb44028b5295d2bdf20a30f5d90b2f13e. Specifically due to commits https://github.com/virtio-win/kvm-guest-drivers-windows/commit/f1338bb7edd712bb9c3cf9c9b122e60a3f518408 and https://github.com/virtio-win/kvm-guest-drivers-windows/commit/fdf56ddfb44028b5295d2bdf20a30f5d90b2f13e (considered together).

PR includes:

Send SRB fix
Refactoring for optimum performance
WPP instrumentation improvements
SendSRB improvements backported into [viostor]
Minor cleanup

The WPP improvements include:

Macro optimisation
Exit fn() insurance
Prettify trace output
New macros for FORCEDINLINE functions

Look out for 1x TODO + 2x FIXME... RFC please.

Tested on:

QEMU emulator version 9.0.2 (pve-qemu-kvm_9.0.2-2) ONLY
Linux 6.8.12-1-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) x86_64 GNU/Linux
Hypervisor: Intel-based ONLY
Guest: WindowsServer2019 (Win10 build) ONLY
Directory backing ONLY
Promox "VirtIO SCSI Single" with iothreads (multiple HBAs) was used mostly
Limited testing done on single HBA with multiple LUNs (iothread=0)
AIO = threads ONLY - UNTESTED on io_uring or native AIO (suspect this will fix issues here)
Local storage backings (raw and qcow) ONLY, untested on networked backings

YMMV with other platforms and AIO implementations (but I suspect they will be fine).

Performance is significantly higher, even when compared to version 100.85.104.20800 in virtio-win package 0.1.208. I suspect this is mainly because this fix unlocks other improvements made when the regression was introduced.

Occasional throughput of 12+GB/s were achieved in guest (64KiB blocks, 32 outstanding queues and 16 threads).

Sequential reads performed at just below the backing storage max throughput at 7GB/s (1MiB blocks, 8 outstanding queues and 1 thread). Random reads came in at 1.1GB/s, 270K IOPS at 1850 µs latency for (4KiB blocks, 32 outstanding queues and 16 threads). Write performance was approximately 93% of read performance.

I tested on a variety of local storage mediums but the above numbers are for a 7.3GB/s Read / 6GB/s Write 1M IOPS R/W NVMe SSD. So throughput was as expected but IOPS were only ~27%.... It will be interesting to see what numbers come up on a decently sized NVMe RAID0/10 JBOD...! Certainly beware of running on old crusty spindles..! 8^d

It is also worth mentioning that the ETW overhead when running an active trace with all flags set is about 8%.

Freedom for Windows guests once held captive...! 8^D

cc: @vrozenfe @JonKohler @sb-ntnx @frwbr @foxmox

Related external issues (at least confounded by this regression):

https://bugzilla.proxmox.com/show_bug.cgi?id=4295 https://bugzilla.proxmox.com/show_bug.cgi?id=4295 https://bugzilla.kernel.org/show_bug.cgi?id=199727 ... [likely others]

Aug 20 '24 03:08 benyamin-codez

@benyamin-codez

Thanks a lot.

Would it be possible to split this PR into two separate - WPP and SRB related ones?

Can you please share the qemu command line and the performance numbers?

If possible it will be nice to compare IOs/throughput numbers in relation to the number of queues, queue depth, IO size, CPU usage and IO latency for the old and new code. (However, if there is any problem with that I will try to test it by myself).

Best, Vadim.

Aug 20 '24 04:08 vrozenfe

@vrozenfe

Happy to help Vadim.

Can do the split. Gives me chance to fix ProcessQueue() too (https://github.com/virtio-win/kvm-guest-drivers-windows/pull/1135/files#diff-b8f1732536a5020c949209736b628a0af83022fbbded42c0eaef7bc3149dd6c3L1475-R1493)...

I can probably run up some comparisons again too. I'll use diskspd and post the results applicable to each split with the new PRs. How does that sound...?

I've used several different qemu command lines for various scenarios including with blkdebug to simulate failures. Would you like me to post the one coincident with the performance stats?

It's likely to be all using threads with a HBA per disk. That ok? I can share my stat collector script if that helps.

Cheers, Ben

Aug 20 '24 05:08 benyamin-codez

A few passing comments. First, fantastic work, I jumped out of bed seeing this this morning, very exciting stuff.

Some house keeping, small stuff:

Let's split this up as Vadim said. Smaller and more narrow the merrier. As you've seen, driver code is complex, it lives forever, and regressions/nuances can come up weeks/months/years later.
On each commit, can we do a bit of long-hand on the commit msg themselves? Code is code of course, but its good to get a bit of explanation on what the intentions, observations, impacts, etc
On each commit, include a signed-off-by (i.e. git commit -s)
Avoid merge commits (IMHO), each PR should be just regular ole commits only.

Past that, RE sharing scripts/cmdlines/data - yes please! more the merrier, and it will only help the overall community here get better and stronger over time.

RE performance - very exciting stuff, this gets me going. When talking about it, its great to talk about test X, with IO pattern ABC with both before and after. Maybe even throw it in a simple table in markdown in the PR? Bonus points for exact incantations (like diskspd.exe -blah -here -stuff) so that others can reproduce on other configurations

Aug 20 '24 14:08 JonKohler

@JonKohler

Thanks for the contribution and review Jon.

I'll take the housekeeping on board and I'm happy to share my stat collector scripts and results too.

Give me a couple of days - I have some other priorities to deal with...

Aug 21 '24 00:08 benyamin-codez

@frwbr

Thanks for dropping your post above, Friedrich, and for taking the time to build and test it out.

For those interested, check out my inline reply.

Aug 22 '24 01:08 benyamin-codez

@vrozenfe @JonKohler @sb-ntnx @frwbr @foxmox

Please bear in mind that despite fixing the issue mentioned in #756 the performance is not reliable.

Given I've got limited time over the next few days, and I'm not aware of the timing for the next release, would you prefer I spend the time I do have on submitting the new SendSRB fix PR for review or hunt down the issue affecting performance first. Personally, I'm happy to continue working on it here until it's ready for prime time. One reason for this position is that I don't think it is as stable as the 100.85.104.20800 version. The extra performance the fix permits has affected stability and reliability.

Some pictures might help explain what I mean... Each of the following show the end point of multiple runs of my stat collector. There should be 6 distinct peaks in I/O for each test run; the same for CPU... All are using aio=threads with iothread enabled. You will observe that it is very difficult to get stats that don't require caveats and interpretation.

This one is running just the SRB fix (this one doesn't show an example of CPU utilisation - my bad 8^d ):

SrbFix_only

This one is version 100.85.104.20800:

v100 85 104 20800

This one was a test running the code in this draft PR, but with the optimisations for ProcessQueue() removed:

ProcessQ_without_inline

This one is the same, but is running two HBAs with iothread enabled:

ProcessQ_without_inline_2_disk_mq

This last one might appear more reliable, but this is because it is only operating with throughput of 9 to 17% of the others which are all single HBA with iothread enabled, and you should notice it eventually becomes unreliable too. So this is why I am focusing on this issue at the moment.

Also, re the stat collector scripts, can someone suggest a suitable place in the tree, e.g. /vioscsi/stat-collector or /Tools/stat-collector or similar? I'll then do a PR for them too.

Let me know...

Aug 22 '24 01:08 benyamin-codez

Fail on Disk Stress (LOGO). Not surprised. 8^(

Aug 22 '24 06:08 benyamin-codez

FWIW, I have run my reproducer against a vioscsi build with this PR applied on top of 54442f070a44bcfa22602e0786759a0ee80222a6 a few more times, and still haven't seen the vioscsi "Reset to device \Device\RaidPortN was issued" messages + IO lockups from #756 so far. I have also applied only the hunk I suspect to be responsible (diff to 54442f07 is here) and have not seen the issue either.

Given I've got limited time over the next few days, and I'm not aware of the timing for the next release, would you prefer I spend the time I do have on submitting the new SendSRB fix PR for review or hunt down the issue affecting performance first.

IMHO, if this PR indeed contains a fix for the IO lockup issues from #756 (this would be the "SendSRB fix" you mention?), it would be amazing to get the fix in soon. So, if time permits, it would be great if you could open an isolated PR for the SendSRB fix. If the fix is indeed in this hunk, I'd expect the PR to be relatively small and thus easier to review than a combination of multiple things. Also, I'd hope the fix wouldn't have a huge performance impact (but this is just hope speaking, I haven't actually run any performance tests myself).

Though the other improvements to performance and tracing from this PR also sound promising, in comparison to the IO lockups they seem to be of lower priority. But this is just my opinion, YMMV :)

One reason for this position is that I don't think it is as stable as the 100.85.104.20800 version. The extra performance the fix permits has affected stability and reliability.

Can you elaborate what you mean by "stability and reliability" here? Do you mean the performance is not as uniform over time (more "bursty")?

Some pictures might help explain what I mean... Each of the following show the end point of multiple runs of my stat collector. There should be 6 distinct peaks in I/O for each test run; the same for CPU... All are using aio=threads with iothread enabled. You will observe that it is very difficult to get stats that don't require caveats and interpretation.

Sorry, I'm not sure what I should be looking at in the screenshots and how to compare the different results. Could you explain in more detail what setup was tested in which scenario (maybe with the code git revision and QEMU command line), and share the diskspd outputs?

Fail on Disk Stress (LOGO). Not surprised. 8^(

Which version of the code fails, and in which sense does it fail?

Again, many thanks for you work!

Aug 22 '24 15:08 frwbr

Agreed - stability and lockup style bugs must come before optimization type work. If we can isolate that, let's peel that to a separate very skinny commit/PR and get that cooking first

Aug 22 '24 16:08 JonKohler

@JonKohler @frwbr, I agree with you both. I think we are all on the same page.

I've got some answers to your questions Friedrich. I'm pretty sure I'm on a path now to getting it sorted. I've got to step away for a couple of hours, but I'll post here as soon as I can.

In a meantime, a picture to show what has to be compared. I'm sure you will agree that makes stats pretty useless unless you do relatively short stints over a very long period.

highlight_inconsist_v100 85 104 20800

Aug 22 '24 21:08 benyamin-codez

@JonKohler @frwbr, I agree with you both. I think we are all on the same page.

I've got some answers to your questions Friedrich. I'm pretty sure I'm on a path now to getting it sorted. I've got to step away for a couple of hours, but I'll post here as soon as I can.

Great! I myself will be away from computers for ~2 weeks now, but will be happy to review/test when I'm back. Again, thanks for your work!

Aug 23 '24 14:08 frwbr

For those watching:

I've spent a couple more hours tracing this through, and it looks like we have at least three issues.

First, the issue of notifications. That seems to be solved for the most part with fixes already in this PR.

Second, there are some not insignificant issues with spinlock management. This is what I am focused on at the moment.

If it wasn't obvious to anyone, we only issue StorPortReleaseSpinLock() for non-InterruptLock type spinlocks:

https://github.com/virtio-win/kvm-guest-drivers-windows/blob/54442f070a44bcfa22602e0786759a0ee80222a6/vioscsi/helper.c#L527-L541

That's an issue, but interestingly, if we fix that, we can reliably reproduce kvm: virtio: zero sized buffers are not allowed when using DpcLock type locks in SendSRB(). This appears to be due to multiple asynchronous DPCs or higher level interrupts being issued without the bounds of sufficient queuing (I have seen both in traces). Using, or should I say reverting to, InterruptLock type spinlocks resolves this issue - likely because other types of spin-locks cannot override these. There is also another potential source in SendSRB() that might produce the same error coincident to any such spin-lock contention:

https://github.com/virtio-win/kvm-guest-drivers-windows/blob/54442f070a44bcfa22602e0786759a0ee80222a6/vioscsi/helper.c#L63-L64

I should mention that using MSI-X based spinlocks don't appear to cause any errors - noting I haven't properly stress tested them yet - but graphical representations of the I/O they produce look problematic and erratic. They also didn't immediately solve the irregularities in performance. We use InterruptSynzhronizePerMessage so the concurrent use of InterruptLock type spinlocks with StorPortAcquireSpinLock or using StorPortSynchronizeAccess per, e.g. SendTMF() are potential problems. Checking the impacts of different vector allocations is on my TODO list.

The third issue looks to be related to be the use of virtqueue assignment. It looks like there are several issues here which I am slowly unpacking... It appears in some places it is possible that a virtqueue might be replaced mid-process upon entering a new function. This could be another cause of kvm: virtio: zero sized buffers are not allowed.

The work continues in a limited fashion for the next few days...

Aug 24 '24 23:08 benyamin-codez

For those watching:

I think I'm almost done and hope to provide a report here in the next 12 to 24 hours.

I'll then split up the commits into new PRs.

The performance reliability issue turned out to be a bit of red herring. Proper preparation of the the diskspd test files with a 2GiB size and random data contents using diskspd.exe -w100 -Zr -d30 <testfile> resolved that for the most part. A pic showing consistent I/O patterns:

smooth_sailing

Sep 07 '24 02:09 benyamin-codez

A somewhat belated progress report:

So I've had trouble reproducing faults with diskspd.exe test files containing random data. The faults do occur with zeroed test files, which I find quite interesting...

It suggests there might be one or more race conditions at play when dealing with zeroed files / data. This could also potentially be a source for the kvm: virtio: zero sized buffers are not allowed errors, but note that here I am only speculating. In any case, we might be seeing errors during file allocation operations e.g. as zeroed-out extents are added, or where extents are verified to be zeroed. IIRC, this does occur both in SQL and ESE databases where errors have been reported.

In relation to my fixes: apart from the initial notification fix in SendSRB(), I have the fix for VioScsiVQUnlock() and some further improvements for DPC processing.

It's noteworthy that AFAICT, we really only do DPC over MSI-X now. It would seem that a call to use an InterruptLock type spinlock would now be a most atypical operation. Although I built a spinlock manager function capable of managing all types of spinlocks, including MSI/MSI-X spinlocks, the reality is that DpcLock type spinlocks are likely the only ones we would now explicitly call (StorPort automatically holds InterruptLock and StartIoLock spinlocks while executing some functions).

I also have a large amount of refactoring, much of which is perhaps of little benefit, but might be of some use. Some of the useful stuff is probably my tracing and instrumentation improvements. I will probably drop a branch on my fork with all my edits if anyone is interested. At the moment I am rebasing this work against master to produce smaller, targeted commits for new PRs. I'm pretty sure I also fixed a couple of orphaned sections from previous squashings, which this work will find (provided I didn't create the orphans from my own refactoring).

I did check virtqueue assignments and also MSI-X CPU affinity, and found them to be working correctly in a wide variety of operational scenarios. In my refactored solution, I did implement StorPortInitializePerfOpts() earlier in VioScsiHwInitialize(), i.e. first, but implementing the necessary instrumentation against master showed that master worked anyway with it the other way around.

I'm just making one last revisit to PhysicalBreaks and VIRTIO_MAX_SG and related relationships, but my environment includes some disks with .../queue/max_hw_sectors_kb:32767 which I presume results in scsi_config.max_sectors = 0xFFFF (65,535) via 32,767*1,024/512=0xFFFE ==> + 1 sector = 0xFFFF. For me I also get .../queue/max_segments:168 for these devices which was unexpected. I also have .../queue/max_hw_sectors_kb:128 for some other devices, but a mix of .../queue/max_segments:60 (SATA3 HDDs) and .../queue/max_segments:33 (NVMe SSDs) for these. I suspect QEMU is reporting the highest maximum of all attached devices + 1, i.e. scsi_config.max_sectors = 0xFFFF (65,535) and seg_max appears to simply be MAX_CPU - VIRTIO_SCSI_REQUEST_QUEUE_0... ¯\_(ツ)_/¯

To that end, can someone please confirm from a trace what they get during GetScsiConfig() (it follows right after PCI capability processing during driver initialisation, so near the top of the trace) for seg_max and max_sectors..? The relevant results of the following host-side commands would also likely be of some assistance: grep "" /sys/block/*/queue/max_segments grep "" /sys/block/*/queue/max_hw_sectors_kb

@JonKohler and @frwbr (and anyone else so inclined), are you able to help with this?

@vrozenfe, is there any reason we don't enable the following? STOR_PERF_DPC_REDIRECTION_CURRENT_CPU STOR_PERF_OPTIMIZE_FOR_COMPLETION_DURING_STARTIO I found these worked well provided VIRTIO_SCSI_CONTROL_QUEUE != MessageId=0... Also, do we still need ConfigInfo->Dma32BitAddresses = TRUE for IA-32 support...?

Hope to wrap this up in short order... 8^d

Sep 08 '24 15:09 benyamin-codez

@benyamin-codez

Technically enabling STOR_PERF_DPC_REDIRECTION_CURRENT_CPU and STOR_PERF_OPTIMIZE_FOR_COMPLETION_DURING_STARTIO work quite well for storport miniport drivers can improve performance a bit more. I tested it a bit on viostor (virtio-blk) driver a long time ago but didn't add the relevant code to the virtio-scsi driver. In any case to make STOR_PERF_OPTIMIZE_FOR_COMPLETION_DURING_STARTIO code working we need to have some per-cpu list to keep SRBs, extend StartIo routine a bit, and make sure that we complete those SRBs in the per-cpu queues in case of bus/lun reset.

Regarding Dma32BitAddresses. On a real hardware HBA it should be a must to let the DMA engine to work properly. I don't know it enabling or disabling this bit will have any implication for qemu virtio, never tired to turn it off, but I know for sure that some HW vendors run this code on top of FPGA implemented virtio adaptors and this is the reason why we will keep it turned on :)

Sep 09 '24 00:09 vrozenfe

Hi @benyamin-codez!

Unfortunately I know very little about Windows internals, so can't comment on your observations w.r.t. the different types of spin locks.

So I've had trouble reproducing faults with diskspd.exe test files containing random data. The faults do occur with zeroed test files, which I find quite interesting...

This is quite interesting. I haven't run the reproducer with all-zero test files so far, only with random files. Do I understand correctly these faults are not fixed by the notifications patch?

To that end, can someone please confirm from a trace what they get during GetScsiConfig() (it follows right after PCI capability processing during driver initialisation, so near the top of the trace) for seg_max and max_sectors..? The relevant results of the following host-side commands would also likely be of some assistance: grep "" /sys/block/*/queue/max_segments grep "" /sys/block/*/queue/max_hw_sectors_kb

@JonKohler and @frwbr (and anyone else so inclined), are you able to help with this?

Here is the trace:

--> VioScsiFindAdapter. --> VioScsiWmiInitialize. InitHW. MessageControl.TableSize = 6 MessageControl.FunctionMask = 0 MessageControl.MSIXEnable = 1 MessageTable = 1 PBATable = 2049 CapabilityID = 9, Next CapOffset = 84 CapabilityID = 9, Next CapOffset = 70 CapabilityID = 9, Next CapOffset = 60 CapabilityID = 9, Next CapOffset = 50 CapabilityID = 9, Next CapOffset = 40 msix_enabled = 1 GetScsiConfig. seg_max 254 num_queues 4 max_sectors 65535 cmd_per_lun 128 event_info_size 16 sense_size 96 cdb_size 32 max_channel 0 max_target 255 max_lun 16383 SetGuestFeatures.

The VM has only one disk attached via SCSI, it is a raw disk and on the host side it is an LVM Logical Volume (/dev/dm-13) on a Volume Group on /dev/nvme4n1. See here for the {max_hw_sectors_kb,max_segments}.

Regarding the max_sectors=65535 -- could this simply be the default value set by QEMU here?

Hope this helps. Happy to provide more data if needed.

Sep 16 '24 15:09 frwbr

Thanks @frwbr ! That's very helpful.

So I've had trouble reproducing faults with diskspd.exe test files containing random data. The faults do occur with zeroed test files, which I find quite interesting...

This is quite interesting. I haven't run the reproducer with all-zero test files so far, only with random files. Do I understand correctly these faults are not fixed by the notifications patch?

The patch works. When using an unpatched version, I found it more difficult to produce the fault with random data, so perhaps YMMV. Maybe try with test files prepared with diskspd.exe -w100 -Zr -d30 <testfile>. I used this with a 2GiB test file. If your file is larger you might need a longer run than 30s.

Regarding the max_sectors=65535 -- could this simply be the default value set by QEMU here?

It is, and seg_max is right above it (via seg_max_adjust), which you will see chooses between the legacy behaviour of 126 (128 - 2) and the more recent 254 (256 - 2), per QEMU virtio_scsi_get_config. The change notes point out that this relates to queue depth and thus the number of request queues, i.e. num_queues = num_cpus, and so MAX_CPU - VIRTIO_SCSI_REQUEST_QUEUE_0 (256 - 2).

This is mostly performance related, as setting the correct NOPB (NumberOfPhysicalBreaks) is important to performance and correcting the shortcomings of this is done via custom QEMU mods or using the PhysicalBreaks registry setting. This is because NOPB should always be ~~seg_max - 1~~ seg_max + 1, and MaximumTransferLength should be seg_max * PAGE_SIZE (EDIT: ~~or NOPB * PAGE_SIZE~~).

The following table should be insightful:

sectors          size = Bytes    /2^10 = KiB   /2^10 = MiB    : KiB      /page = segments
65536 (0x10000h) *512 = 33554432 /1024 = 32768 /1024 = 32     : 33554432 /4096 = 8192 *** OUTSIDE BOUNDS ***
65528 (0x0FFF8h) *512 = 33550336 /1024 = 32764 /1024 = 31.996 : 33550336 /4096 = 8191 qemu hint (scsi_config.max_sectors) limit (last segment before 0xFFFF)
49152 (0x0C000h) *512 = 25165824 /1024 = 24576 /1024 = 24     : 25165824 /4096 = 6144
32768 (0x08000h) *512 = 16777216 /1024 = 16384 /1024 = 16     : 16777216 /4096 = 4096
16384 (0x04000h) *512 =  8388608 /1024 =  8192 /1024 = 8      :  8388608 /4096 = 2048
 8192 (0x02000h) *512 =  4194304 /1024 =  4096 /1024 = 4      :  4194304 /4096 = 1024
 4096 (0x01000h) *512 =  2097152 /1024 =  2048 /1024 = 2      :  2097152 /4096 =  512 *** MAX_PHYS_SEGMENTS LIMIT ***
 2048 (0x00800h) *512 =  1048576 /1024 =  1024 /1024 = 1      :  1048576 /4096 =  256
 2032 (0x007F0h) *512 =  1040384 /1024 =  1016                :  1040384 /4096 =  254 qemu scsi_config.seg_max limit (max_cpu - 2)
 1344 (0x00540h) *512 =   688128 /1024 =   672                :   688128 /4096 =  168
 1024 (0x00400h) *512 =   524288 /1024 =   512                :   524288 /4096 =  128  
  512 (0x00200h) *512 =   262144 /1024 =   256                :   262144 /4096 =   64
  480 (0x001E0h) *512 =   245760 /1024 =   240                :   245760 /4096 =   60
  448 (0x001C0h) *512 =   229376 /1024 =   224                :   229376 /4096 =   56                                                                    
  264 (0x00108h) *512 =   135168 /1024 =   132                :   135168 /4096 =   33
  256 (0x00100h) *512 =   131072 /1024 =   128                :   131072 /4096 =   32

In your NVMe scenario, you should have aligned values of seg_max = 32, NOPB = ~~31~~ ~~32~~ 33 and MaximumTransferLength = 128KiB. EDIT: This cannot be seg_max = 33 or your MaximumTransferLength will be 132KiB and exceed your max_hw_sectors_kb of 128KiB.

So I've worked on this a bit and I have a new solution, in part it looks like this:

EDIT: The NOPB calculations in this are effectively off-by-one (-1). They should be one page higher, i.e. NOPB = seg_max+1 and not NOPB = seg_max or even NOPB = seg_max-1... I will repost below once corrected.

ULONG              nopb_candidate[3] = { 0 };
ULONG              max_segments_b4_alignment;
//...
if (!adaptExt->dump_mode) {
    /* Allow user to override max_physical_breaks via reg key
        * [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device]
        * "PhysicalBreaks"={dword value here}
        * 
        * *** This should be VIRTIO_MAX_SG - 1, approximated by the maximum number of memory pages (typ. 4KiB each) - 1 ***
        */
    if (VioScsiReadRegistryParameter(DeviceExtension, REGISTRY_MAX_PH_BREAKS, FIELD_OFFSET(ADAPTER_EXTENSION, max_physical_breaks))) {
        /* We +1 to convert to segments from NOPB */
        adaptExt->max_segments = adaptExt->max_physical_breaks + 1;
        
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_physical_breaks candidate was specified in the registry : %lu | max_segments : %lu \n", 
                        adaptExt->max_physical_breaks, adaptExt->max_segments);
        #endif
    } else {
        /* Grab the VirtIO reported maximum SEGMENTS value from the HBA and put it somewhere mutable */
        adaptExt->max_segments = adaptExt->scsi_config.seg_max;
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_physical_breaks candidate was NOT specified in the registry. We will attempt to derive the value...\n");
        #endif
    }

    /* Use our maximum SEGMENTS value OR use PHYS_SEGMENTS... */
    nopb_candidate[1] = (adaptExt->indirect) ? (adaptExt->max_segments - 1) : (PHYS_SEGMENTS - 1);
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_physical_breaks candidate derived from MAX SEGMENTS : %lu \n", nopb_candidate[1]);
    #endif

    /* Grab the VirtIO reported maximum SECTORS value from the HBA to start with */
    nopb_candidate[2] = (adaptExt->scsi_config.max_sectors * SECTOR_SIZE / PAGE_SIZE) - 1;
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_physical_breaks candidate derived from MAX SECTORS (QEMU/KVM hint) : %lu \n", nopb_candidate[2]);
    #endif
    
    /* Start with a comparison of equality */
    if (nopb_candidate[1] == nopb_candidate[2]) {
        nopb_candidate[0] = nopb_candidate[1];
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " nopb_candidate[0] : init - the candidates were the same value : %lu \n", nopb_candidate[0]);
        #endif
    } else if (nopb_candidate[2] > 0 && nopb_candidate[2] < (MAX_PHYS_SEGMENTS - 1)) {
        nopb_candidate[0] = nopb_candidate[2];
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " nopb_candidate[0] : init - the QEMU/KVM hint method (scsi_config.max_sectors) was used to select the candidate : %lu \n", nopb_candidate[0]);
        #endif
    } else {
        /* Take the smallest candidate */
        nopb_candidate[0] = min((nopb_candidate[1]), (nopb_candidate[2]));
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " nopb_candidate[0] : init - the smallest candidate was selected : %lu \n", nopb_candidate[0]);
        #endif
    }

    /* Check the value is within SG list bounds */
    nopb_candidate[0] = min(max(SCSI_MINIMUM_PHYSICAL_BREAKS, nopb_candidate[0]), (VIRTIO_MAX_SG - 1));
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " nopb_candidate[0] : within SG list bounds : %lu\n", nopb_candidate[0]);
    #endif

    /* Check the value is within physical bounds */
    nopb_candidate[0] = min(max(SCSI_MINIMUM_PHYSICAL_BREAKS, nopb_candidate[0]), (MAX_PHYS_SEGMENTS - 1));
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " nopb_candidate[0] : within physical bounds : %lu\n", nopb_candidate[0]);
    #endif        

    /* Update max_segments for all cases */
    adaptExt->max_segments = nopb_candidate[0] + 1;
    max_segments_b4_alignment = adaptExt->max_segments;

    /* Do byte alignment (using integer division) if necessary */
    if (max_segments_b4_alignment > (PAGE_SIZE / SECTOR_SIZE)) {
        adaptExt->max_physical_breaks = (((max_segments_b4_alignment / (PAGE_SIZE / SECTOR_SIZE)) * (PAGE_SIZE / SECTOR_SIZE)) - 1);
        if (max_segments_b4_alignment != (adaptExt->max_physical_breaks + 1)) {
            adaptExt->max_segments = adaptExt->max_physical_breaks + 1;
        }
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, 
                        " Sector byte alignment : SECTOR_SIZE = %lu Bytes, PAGE_SIZE = %lu KiB, max_segments : original = %lu, aligned = %lu, max_physical_breaks : original = %lu, aligned = %lu \n", 
                        SECTOR_SIZE, (PAGE_SIZE / 1024), max_segments_b4_alignment, adaptExt->max_segments, nopb_candidate[0], adaptExt->max_physical_breaks);
        #endif
    }
}
ConfigInfo->NumberOfPhysicalBreaks = adaptExt->max_physical_breaks;
/* MaximumTransferLength should be calculated from segments not breaks... */
ConfigInfo->MaximumTransferLength  = adaptExt->max_segments * PAGE_SIZE;

#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_INFORMATION, " NumberOfSegments : %lu | NumberOfPhysicalBreaks : %lu | MaximumTransferLength : %lu Bytes (%lu KiB) \n", 
                (ConfigInfo->NumberOfPhysicalBreaks + 1), 
                ConfigInfo->NumberOfPhysicalBreaks, 
                ConfigInfo->MaximumTransferLength, 
                (ConfigInfo->MaximumTransferLength / 1024));
#endif

This is only part of the solution, ~~as there is a cascading off-by-one error in VioScsiFindAdapter that required fixing~~, but this solution will automatically set seg_max, NOPB and MaximumTransferLength to best performance settings based on the PhysicalBreaks registry entry.

The problem becomes that even if enabling per HBA registry settings by modding VioScsiReadRegistryParameter to use a per HBA scope, all HBAs will still look in the same registry key (Parameters\Device-1). This becomes a performance issue when you have a mix of physical disks with different requirements, e.g. look at my dev box:

The .../queue/max_segments:168, .../queue/max_hw_sectors_kb:32767 devices are SATA3 HDDs The .../queue/max_segments:60, .../queue/max_hw_sectors_kb:128 devices are MegaRAID LUNs The .../queue/max_segments:33, .../queue/max_hw_sectors_kb:128 devices are NVMe SSDs

I can only set NOPB for the lowest max_hw_sectors. However, the MegaRAID LUNs perform best with seg_max = 60, NOPB = 59 and MaximumTransferLength = 240KiB, and so on... Note: To be clear, on execution this particular example is changed to align to seg_max = 56, NOPB = 55 and MaximumTransferLength = 224KiB. EDIT: This alignment is unnecessary when NOPB is set correctly.

I am presently working on this but it might be a fundamental architectural barrier. I have a few suspects, like they all share the same hba_id and port, but it's early days...

@vrozenfe, can you provide a hint as to how to get each HBA instance to act separately, with it's own memory allocation and NOPB settings...?

I also have fixes for memory allocation, including for max_queues to cater for CPU hotplug. I added a registry setting to enable this so the default is to retain the existing behaviour (only allocate for present CPUs). Otherwise, it's just the previously mentioned notification fix in SendSRB, some performance tuning, some DPC fixes, lots of DPC refactoring and tons of instrumentation.

Sep 16 '24 23:09 benyamin-codez

FYI there is a RH Jira issue reported internally https://issues.redhat.com/browse/RHEL-56722 and which I'm trying to fix soon.

Sep 17 '24 00:09 vrozenfe

@vrozenfe

Per my comments in my last post above. Could the fact using a per HBA scope in VioScsiReadRegistryParameter (StorPortRegistryRead) requires a reg key of Parameters\Device-1 be a hint for what is perhaps going on here?

Normally these would be Parameters\Device0, Parameters\Device1, and so on... We seem to have -1 instead...

Thanks for the hints in your last post here. My observations:

In any case to make STOR_PERF_OPTIMIZE_FOR_COMPLETION_DURING_STARTIO code working we need to have some per-cpu list to keep SRBs, extend StartIo routine a bit, and make sure that we complete those SRBs in the per-cpu queues in case of bus/lun reset.

From what I could tell, this seems to be already taken care of by STOR_PERF_DPC_REDIRECTION_CURRENT_CPU and CompletePendingRequests (which I renamed to CompletePendingRequestsOnReset). It appeared to work well in my tests...

Regarding Dma32BitAddresses. On a real hardware HBA it should be a must to let the DMA engine to work properly. I don't know it enabling or disabling this bit will have any implication for qemu virtio, never tired to turn it off, but I know for sure that some HW vendors run this code on top of FPGA implemented virtio adaptors and this is the reason why we will keep it turned on :)

I found enabling this manually (StorPort does it automagically) resulted in a loss of performance.

I came up with this:

/* NOTE: When unset we get -5k IOPS +30us latency (best case)...! */
ConfigInfo->Master                       = TRUE; // +7k IOPS -50us latency
ConfigInfo->ScatterGather                = TRUE; // +12k IOPS -75us latency

/* Allow user to force use of Dma32BitAddresses via reg key
* [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device]
* "UseDma32BitAddresses"={any dword value here - the value is ignored}
* 
* WARNING: Manually setting this increases latency and reduces IOPS
* 
* NOTE: StorPort normally sets this to TRUE anyway.
*       So let StorPort do it for maximum performance.
*       Only provided in the event StorPort does not enable the feature and it is required.
*/
if (VioScsiReadRegistryParameter(DeviceExtension, REGISTRY_USE_DMA32BITADDRESSES, FIELD_OFFSET(ADAPTER_EXTENSION, use_Dma32BitAddresses))) {
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_USE_DMA32BITADDRESSES was FOUND in the registry. We will set ConfigInfo->Dma32BitAddresses to TRUE. \n");
    #endif
    ConfigInfo->Dma32BitAddresses = TRUE; // -15k IOPS +100us latency (worst case)
} else {
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_USE_DMA32BITADDRESSES was NOT FOUND in the registry. We will let StorPort manage the Dma32BitAddresses setting. \n");
    #endif
}

/*
    * WARNING: Do not set this.
    *          All of these options increase latency and reduce IOPS:
    * 
ConfigInfo->DmaWidth                     = 0; //<-- This should be zero as initialised by StorPort
ConfigInfo->DmaWidth                     = Width32Bits; //<-- This should be zero as initialised by StorPort
ConfigInfo->DmaWidth                     = Width64Bits; //<-- This should be zero as initialised by StorPort
*/

The #if !defined(RUN_UNCHECKED) conditionals strip all tracing... As should be expected, running without tracing produces slightly better performance than even the driver included in the v208 package.

I also implemented some tracing with #if !defined(RUN_UNCHECKED) || defined(RUN_MIN_CHECKED), mainly for non-verbose tracing in initialisation routines. This tracing has no observable effect on performance.

I'm not sure what this would mean for packaging, but the performance gains of the above are not negligible. I've seen improvements with 4KiB random I/O of upto 50k IOPS and latency reduced by up to 250us.

In any case, my plan is to sort out the per HBA stuff (one way or the other), then I'll drop the whole lot in a branch of my fork to split into new branches and make PRs from. In the interim, I hope to split the notification fix in SendSRB and some perhaps consequential DPC fixes into two separate PRs - probably by week's end. Would that work for you?

Sep 17 '24 00:09 benyamin-codez

@benyamin-codez If I understand your question about "get each HBA instance to act separately" correctly then adjusting the transfer length according the block limits reported by the underlying scsi device for the scsi pass-through case as mentioned in https://issues.redhat.com/browse/RHEL-56722 can probably fix some of the problems.

Best regards, Vadim.

Sep 17 '24 00:09 vrozenfe

@vrozenfe

FYI there is a RH Jira issue reported internally https://issues.redhat.com/browse/RHEL-56722 and which I'm trying to fix soon.

~~There's definitely an off-by-one cascading error in VioScsiFindAdapter. The problem is likely because NOPB is set to 64 not 63...~~ EDIT: The problem is likely because NOPB is not set correctly. Or by default set to 254 which is not aligned... EDIT: Alignment is not an issue when NOPB is set correctly.

To not rely on the PhysicalBreaks registry entry will require QEMU changes. Methinks once vioscsi is corrected, a PR for QEMU would be the next logical step.

By the way, I found the Parameters\Device-1 by issuing a StorPortRegistryWrite(). Then spent several hours trying to get REG_SZ working to drop HBA location information to map to Device Manager so the correct PhysicalBreaks registry entries could be made more easily. I tried ASCII, ANSI and Unicode. None of them worked. REG_NONE and REG_MULTI_SZ worked unreliably with some garbage thrown in from conversions. In the end I went back to REG_DWORD and just dropped adaptExt->slot_number in it after InitHW() had returned.

Sep 17 '24 00:09 benyamin-codez

@vrozenfe

@benyamin-codez If I understand your question about "get each HBA instance to act separately" correctly then adjusting the transfer length according the block limits reported by the underlying scsi device for the scsi pass-through case as mentioned in https://issues.redhat.com/browse/RHEL-56722 can probably fix some of the problems.

Apologies for the cross posting... 8^d

That's not really what I meant.

Per my last posts:

...how to get each HBA instance to act separately, with it's own memory allocation and NOPB settings...?

Could the fact using a per HBA scope in VioScsiReadRegistryParameter (StorPortRegistryRead) requires a reg key of Parameters\Device-1 be a hint for what is perhaps going on here?

Normally these would be Parameters\Device0, Parameters\Device1, and so on... We seem to have -1 instead...

Sep 17 '24 00:09 benyamin-codez

I thought it might help if I posted a preview of the tracing highlighting the NOPB section.

I should note this is new behaviour from my current WIP build.

Here is the first page:

nvme_log_first_page

Here is a page showing the relevant NOPB section:

nvme_log_nopb_32-31

Here is the next page:

nvme_log_next_page

Here is a page showing what happens without setting the PhysicalBreaks registry setting, i.e. the default NOPB of 247 (starting at 253):

nvme_log_nopb_default_253-247

Here is another showing NOPB set to 255 via PhysicalBreaks:

nvme_log_nopb_255

Here is another showing NOPB set to 511 via PhysicalBreaks:

nvme_log_nopb_511

Here are some quick CrystalDiskMark stats:

20240917_cdm_nvme_compare

I'll try to drop my WIP build in the next few hours, then some split PRs to follow.

EDIT: These NOPB values should all be ~~one page~~ two pages more. I will correct, retest and report below.

Sep 17 '24 08:09 benyamin-codez

Here is the v208 driver for comparison. Subsequent versions have the same behaviour.

cdm_checked_p_nopb_various_v208

EDIT: The NOPB figures to the right of the >> are in hexadecimal, so 513 = 0x201h and 33 = 0x21h, i.e. (correctly) off-by-one. The queue depth here is 256 (0x100h), i.e. MAX_CPU, which is the same in most versions.

Had to do a few fixes for single CPU configuration and found another DPC fix in my travels... Will try to get my WIP build posted shortly... 8^d

Sep 17 '24 20:09 benyamin-codez

I dropped my WIP build if anyone wants to have a look: https://github.com/benyamin-codez/kvm-guest-drivers-windows/commit/a095c2da78a0711c2c71bf83d70ef5a0593f335b

Splits to follow...

Sep 18 '24 00:09 benyamin-codez

@vrozenfe

If possible it will be nice to compare IOs/throughput numbers in relation to the number of queues, queue depth, IO size, CPU usage and IO latency for the old and new code. (However, if there is any problem with that I will try to test it by myself).

wrt my stat collector scripts (which should provide all you asked for above), where would be best to drop them in the tree? Under vioscsi\stat-collector or similar, or maybe under ..\Tools\stat-collector...?

I also have some blkdebug tooling to perform block I/O error injection, per this and this (which the former refers to). Would you like me to drop that somewhere (else) too? It's pretty ugly, but I can clean it up.

Sep 18 '24 08:09 benyamin-codez

@vrozenfe

Considering further where a good spot in the tree might be, perhaps one of the following would work better: vioscsi\tests ..\Tools\vioblk\tests

The script can then be called stat-collector.bat or similar, with room for future scripts like stat-aggregator.pl, etc.

This might also be a good location for any blkdebug or similar kit too.

Sep 18 '24 21:09 benyamin-codez

@vrozenfe

My bad on the rename. Maybe kill those checks.... 8^d

Sep 19 '24 01:09 benyamin-codez

A PR for the SendSRB notification fix has been raised: #1150

Sep 19 '24 02:09 benyamin-codez

A PR for two DPC fixes has been raised: #1152

Sep 19 '24 03:09 benyamin-codez

kvm-guest-drivers-windows kvm-guest-drivers-windows copied to clipboard

[vioscsi] Fix SendSRB regression and refactor for optimum performance [viostor] Backport SendSRB improvements

kvm-guest-drivers-windows
kvm-guest-drivers-windows copied to clipboard