kvm-guest-drivers-windows [viostor] Fix performance issue (regression)

Added missing juice in PR #487 (f8c904c), recommended for SMP with CPU affinity plus increase in MAX_PHYS_SEGMENTS (8x) and MaxXfer size.

Other performance enhancements are available, but this commit should resolve issue #992.

Feb 11 '25 23:02 benyamin-codez

@vrozenfe @YanVugenfirer

If you can take a moment, please also refer to my WIP...

On 4K clusters I had the following results:

	SEQ1M (Q8T1)	RND4K (Q32T16)	RND4K (IOPS)	RND4K (μs)
Baseline (v266)	7174.51	959.80	234325.44	2180.13
This PR	7187.66	1257.28	306952.15	1649.53
WIP	7157.31	1465.76	357850.59	1427.34

... so there are additional optimisations available.

I suspect a NOPB off-by-one regression: when max_seg = 254 (default) "breaks" are recorded as 0xfe (254) rather than 0xff (255). This will impact the calculation of MaxXferLength length too.

Memory allocation alignment could be a culprit too, but I did not extensively test this...

Would you like some more PRs...? Any features in particular, e.g. is the setting of max_segments via the registry of any benefit here?

Feb 12 '25 00:02 benyamin-codez

Top-notch work and debugging as always @benyamin-codez - love to see this work, keep up the good stuff!

Feb 12 '25 02:02 JonKohler

On the WIP branch gains, is there a specific area of work that is more impactful than another? From the commit msg, there were quite a few things stacked together. All looked solid, so I think at a project level, we'd welcome it!

Feb 12 '25 02:02 JonKohler

@JonKohler @vrozenfe @YanVugenfirer

Top-notch work and debugging as always @benyamin-codez - love to see this work, keep up the good stuff!

Thanks Jon. Credit also to @york-king for their work in submitting issue #992. Reducing the point-in-time scope to between builds 187 and 189 was most helpful. From there some tag translation followed by a compare reveals the only viostor commit (f8c904c) to peek into. 8^d

On the WIP branch gains, is there a specific area of work that is more impactful than another? From the commit msg, there were quite a few things stacked together. All looked solid, so I think at a project level, we'd welcome it!

We should first consider that if having MaxXferLength = 2MiB (the upper limit) is important, then we need to set max_segments to 512 (MAX_PHYS_SEGMENTS). However, the virtio limit is (presently) 254, so we need a mechanism to exceed that, e.g. via a registry value or some other guest feature.

Therefore, thinking in order of importance, the splits would probably be:

NOPB off-by-one regression
Memory alignment (otherwise StorPort won't align pages in VA)
Instrumentation upgrades by function.

Preliminary enabling PRs would be required for definitions and helpers, e.g. RegistryRead, etc.

Design questions would be:

If the RegistryRead for parameters is in scope, should the per-HBA functionality be implemented as-is? It is functionally superior but makes for a very noisy log (albeit brief) - even for the likes of me. 8^d It would be best to fix the StorPort native \Parameters\Device(d) option, but I haven't been able to get this working. The limitation here is the HwStorPortProhibitedDDIs rule-set, which rules out direct registry manipulation, otherwise we could mimic the required behaviour.
Should we use a registry parameter for enabling STOR_PERF_DPC_REDIRECTION_CURRENT_CPU instead of relying on the NTDDI_VERSION at compile-time? We could then add an entry to viostor.inf in order to enable it by default for certain configurations. I note this was a marginal optimisation, which I only observed on Windows 11 24H2.
VIRTIO_MAX_SG size. My limited testing didn't need more than (MAX_PHYS_SEGMENTS + 1), but is there a specific scenario which still requires (3+MAX_PHYS_SEGMENTS)...?

Any thoughts?

Feb 12 '25 03:02 benyamin-codez

@YanVugenfirer Please wait for Win11 CI too

Feb 12 '25 11:02 kostyanf14

We should first consider that if having MaxXferLength = 2MiB (the upper limit) is important, then we need to set max_segments to 512 (MAX_PHYS_SEGMENTS). However, the virtio limit is (presently) 254, so we need a mechanism to exceed that, e.g. via a registry value or some other guest feature.

We ran into this shenanigans as well in our product, as there is no virtio way to "advertise" what the backend supports, so the frontend can "just do it". There is a proposal spec change that is floating out there that needs a champion. I was going to pick it up, but then my mother passed away and that just wrecked me for a very long time, and I haven't been able to get back to it.

Hint hint nudge nudge, if you want to pick it up, I'd back you 100% ! - https://github.com/oasis-tcs/virtio-spec/issues/122

What is interesting is that this seems to be a windows specific nuance, as Linux doesn't have this issue :(

Therefore, thinking in order of importance, the splits would probably be:

NOPB off-by-one regression

Memory alignment (otherwise StorPort won't align pages in VA)

Instrumentation upgrades by function.

Fabulouso - sounds good to me boss. I know the NOPB issue I fixed a while back made a non-trivial speedup for large block IO, so it'd be good to "fix that" at the project level in all areas it is busted (perhaps this is the last one).

On memory alignment, is that just a viostor issue? or would that apply to vioscsi as well? How do you see that, just manual tracing? or is there some other thing?

Either way, if we've got pages unaligned, I suspect that's no bueno, so hopefully that's a simple targeted fix

Design questions would be:

If the RegistryRead for parameters is in scope, should the per-HBA functionality be implemented as-is? It is functionally superior but makes for a very noisy log (albeit brief) - even for the likes of me. 8^d It would be best to fix the StorPort native \Parameters\Device(d) option, but I haven't been able to get this working. The limitation here is the HwStorPortProhibitedDDIs rule-set, which rules out direct registry manipulation, otherwise we could mimic the required behaviour.

Shamelessly, for me, a single parameter is fine (i.e single HBA/per host sort of functionality) because we run single HBA configurations for simplicity, though I suspect this would be a nice fit-n-finish to be able to target things more purposefully.

That said, like my comment below on per OS compile time stuff, I'm always in favor of anything that can "just work" and not make the user A) think too hard or B) have too many knobs on their plate and have to be an internals-focused sort of person to get it right.

TLDR, its a toss up from me, I'd put that dead last on the list of things here.

Should we use a registry parameter for enabling STOR_PERF_DPC_REDIRECTION_CURRENT_CPU instead of relying on the NTDDI_VERSION at compile-time? We could then add an entry to viostor.inf in order to enable it by default for certain configurations. I note this was a marginal optimisation, which I only observed on Windows 11 24H2.

Its nice to have a bit to be able to turn off, but I suspect if we make an "opt in" bit, it would either get A) misused B) misunderstood and/or C) never used.

It's always nice when a driver is smart enough to "turn on the right stuff" based on the OS/running environment and/or things that it sees from the "host", so if we can safely turn it on at Compile, absolutely fine by me.

Also, while my mind is on the thought - do we do this for vioscsi already? IIRC no, but perhaps my brain is cooked at the end of week.

VIRTIO_MAX_SG size. My limited testing didn't need more than (MAX_PHYS_SEGMENTS + 1), but is there a specific scenario which still requires (3+MAX_PHYS_SEGMENTS)...?

Not that I'm aware of, but perhaps Vadim et al have some more thoughts on that.

Feb 14 '25 16:02 JonKohler

@JonKohler

Thanks for your feedback, Jon. That was quite helpful. Please also accept my condolences regarding your mother.

As mentioned in the OP of the spec PR you referenced:

QEMU tolerates as well if guests drivers exceed the Queue Size. However currently it has a hard coded limit of max. 1024 memory segments [as] #define VIRTQUEUE_MAX_SIZE 1024, [w]hich limits the max. bulk data size per virtio message to slightly below 4MB.

We could try going to the maximum, which would be: 1024 - 2 = 1022 ==> 1022 * PAGE_SIZE = 4,088KiB MaxXferLen, StorPort_NOPB=1023 Would that be of any value to you?

I note I have only tested to 512 segments (2MiB MaxXferLen, StorPort_NOPB=513).

iirc, each virtqueue presents as an 8KiB buffer, 4KiB in each split ring.

The maximum queue size presently permitted by SeaBIOS is 256 (= MAX_CPU, where 1 vq <=> 1 vCPU). This disparity between queue size and max_segments is one reason I am interested in STOR_PERF_DPC_REDIRECTION_CURRENT_CPU.

So, I had to quadruple check my results, but on closer scrutiny there is quite an odd observation to be made.

Target	Flavour	SEQ1M Q8T1 (MiBps)	RND4K Q32T16 (MiBps)	RND4K Q32T16 (IOPS)	RND4K Q32T16 (μs)
Win10	v266	7179	1071	201,459	1956
Win10	This PR	7189	1547	377,795	1352
Win10	WIP -*	7179	1543	349,766	1356
Win10	WIP +*	7180	1565	382,159	1338
Win10	WIP +VB	7190	1568	382,748	1335
Win11	v266	7160	1038	253,388	2017
Win11	This PR	7165	1041	254,155	2011
Win11	WIP -*	7148	1520	371,015	1375
Win11	WIP +*	7147	1513	369,380	1384
Win11	WIP +CO	7157	1511	368,854	1385
Win11	WIP +GE	7143	1510	368,697	1387

[*] +/- STOR_PERF_DPC_REDIRECTION_CURRENT_CPU, NTDDI_VERSION=WIN_THRESHOLD, 4KiB clusters [VB] + STOR_PERF_DPC_REDIRECTION_CURRENT_CPU, NTDDI_VERSION=WIN_10_VB, 4KiB clusters [CO] + STOR_PERF_DPC_REDIRECTION_CURRENT_CPU, NTDDI_VERSION=WIN_10_CO, 4KiB clusters [GE] + STOR_PERF_DPC_REDIRECTION_CURRENT_CPU, NTDDI_VERSION=WIN_11_GE, 4KiB clusters

You will see that in Win10 there is very little difference between this PR and and the WIP, but in Win11 this PR barely makes a difference but the WIP does, so that will inform the next cut.

Also, the gains for STOR_PERF_DPC_REDIRECTION_CURRENT_CPU which I previously only observed on Windows 11 24H2 appeared to swap for NTDDI_WIN_THRESHOLD, with the a marginal improvement shifting from Win11 to Win10. Personally I think it worthwhile enabling this performance option so that DPC redirection can be set to the vCPU requesting the DPC rather than vCPU originating the IO request with StorPort.

On memory alignment, is that just a viostor issue? or would that apply to vioscsi as well? How do you see that, just manual tracing? or is there some other thing?

No I back-ported it from my vioscsi WIP.

The uncachedExtensionVa and pageAllocationVa will not align unless you add uncachedExtPad2 (this is padding2 in the art). Then StorPort will return a page-aligned VA. Otherwise StorPort adds a whole page to the allocation to ensure it passes the page boundary whilst still providing enough memory. The VA is then trimmed by the bitwise comparator to PAGE_SIZE boundaries resulting in padding1 being > 0 via:

adaptExt->pageAllocationVa = (PVOID)(((ULONG_PTR)(uncachedExtensionVa) + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1));

The alternative would be to use adaptExt->pageAllocationVa = (PVOID)((ULONG_PTR)(uncachedExtensionVa)) or similar but the above is more efficient because a properly padded request will return a smaller allocation.

I only noticed this was an issue because the trace was skewing the numbers when doing integer division (scaling to KiB), so accurate tracing is another benefit of these mechanics.

Thanks for your insights re registry parameters. This was another vioscsi WIP back-port. I think I will use it again here to extend max_segments as discussed above. We presently do not do any per-HBA reads, the PR for vioscsi is #1216. This permits disks with different backings to have HBAs with different max_segments values, which is important for certain pass-through devices.

Regarding the spec work, no guarantees, but I'll put it on my list. I do not anticipate being in a position to pick it up for some time.

Best regards, Ben

Feb 16 '25 13:02 benyamin-codez

You will see that in Win10 there is very little difference between this PR and and the WIP, but in Win11 this PR barely makes a difference but the WIP does, so that will inform the next cut.

Ok, so I'm going to raise a few PRs so that Win11 can benefit from this one too.

My guess is that this anomaly is the reason why the regression made it through, as the performance degradation may not have been evident on Win11. @vrozenfe, perhaps QE would consider performance checking for both targets in future...? I still have some diskspd.exe based data collection scripts I could resurrect if we lack tooling. I didn't write the reporting end yet to parse the diskspd.exe output (I haven't quite been annoyed enough yet with trite manual data collation)... 8^d

Feb 17 '25 02:02 benyamin-codez

Added the following PRs:

#1297 #1298 (waiting on 1297) #1299 (waiting on 1298) #1300 (waiting on 1298) #1301 #1302 (waiting on 1297)

This should fix Win11 too.

Feb 17 '25 05:02 benyamin-codez

@JonKohler

We could try going to the maximum, which would be: 1024 - 2 = 1022 ==> 1022 * PAGE_SIZE = 4,088KiB MaxXferLen, StorPort_NOPB=1023 Would that be of any value to you?

Just wanting to check if you were interested in this...

tbh, I'm not convinced that the num_queues - 2 applies to viostor... ...as it doesn't supply CONTROL and EVENT queues or MSI-X vectors for the same for that matter. It could be: VIRTQUEUE_MAX_SIZE = 1024 ==> 1024 * PAGE_SIZE = 4,096KiB MaxXferLen, StorPort_NOPB=1025 ...but that is presently untested...

Let me know if you want me to run it up.

Best regards, Ben

Feb 20 '25 13:02 benyamin-codez

@JonKohler

We could try going to the maximum, which would be: 1024 - 2 = 1022 ==> 1022 * PAGE_SIZE = 4,088KiB MaxXferLen, StorPort_NOPB=1023 Would that be of any value to you?

Just wanting to check if you were interested in this...

tbh, I'm not convinced that the num_queues - 2 applies to viostor... ...as it doesn't supply CONTROL and EVENT queues or MSI-X vectors for the same for that matter. It could be: VIRTQUEUE_MAX_SIZE = 1024 ==> 1024 * PAGE_SIZE = 4,096KiB MaxXferLen, StorPort_NOPB=1025 ...but that is presently untested...

As I thought, 1024 for viostor (for 4MiB xfers) and 1022 for vioscsi (for 4,088KiB xfers). PRs #1315 and #1316 respectively,

Enjoy...!

Ben

Feb 26 '25 09:02 benyamin-codez

Based on the following results, correct me if I'm wrong, but I think this one is ready to merge...

Check	`Disk Stress (LOGO)`	`Flush Test`	`Disk Verification (LOGO)`
`viostor-Win2022x64`	Pass	Fail	Pass
`viostor-Win11_24H2x64_host_uefi_viommu`	Fail	Pass	Fail
`viostor-Win2025x64_host_uefi`	Pass	Pass	Pass

Feb 26 '25 09:02 benyamin-codez

There is no need to increase the maximum transfer size this way. The memory allocation mechanism for SG elements and descriptors will be refactored in order to support IOMMU and DMA v3 in any case.

Best regards, Vadim.

Mar 02 '25 09:03 vrozenfe

@vrozenfe

There is no need to increase the maximum transfer size this way. The memory allocation mechanism for SG elements and descriptors will be refactored in order to support IOMMU and DMA v3 in any case.

I presume you mean this comment:

As I thought, 1024 for viostor (for 4MiB xfers) and 1022 for vioscsi (for 4,088KiB xfers). PRs https://github.com/virtio-win/kvm-guest-drivers-windows/pull/1315 and https://github.com/virtio-win/kvm-guest-drivers-windows/pull/1316 respectively,

...and not this PR..?

I have moved the rest of this conversation to PR #1316.

Perhaps in the meantime (while you work on IOMMU and DMA v3), those PRs might work well...? At risk of hijacking this PR, I should also ask: is there a reason it shouldn't be done this way?

Mar 02 '25 12:03 benyamin-codez

[Jenkins CI]: Can one of the admins verify this patch?

Aug 15 '25 10:08 YanVugenfirer

kvm-guest-drivers-windows kvm-guest-drivers-windows copied to clipboard

[viostor] Fix performance issue (regression)

kvm-guest-drivers-windows
kvm-guest-drivers-windows copied to clipboard