Divert icon indicating copy to clipboard operation
Divert copied to clipboard

Performance impact?

Open FCrane opened this issue 9 years ago • 42 comments

Hi!

I'm testing the passthru example with "true" and "8" as parameters (also tried 1) and it works fine. However, copying a file over the network that usually runs at about 90 MB/s slows down to 25 MB/s. CPU load of the passthru program is between 20 and 25%.

Does WinDivert really slow down network traffic that much? Can this be improved? Other solutions, like WinpkFilter, have a much smaller impact (e.g. just 5% CPU load, just 10% drop in transfer rate).

Thanks!

FCrane avatar Sep 04 '15 08:09 FCrane

I've now tested this again and indeed the performance impact is huge. On a gigabit LAN, running the passthru example slows the network speed down to about 30%! I'm losing almost 70% of the network speed on a fast machine (Intel i7). The other network packet filter "WinpkFilter" does not show that issue. It lets the traffic pass at full speed with a fraction of the CPU load WinDivert uses...

Is WinDivert really so slow?

FCrane avatar Sep 04 '15 18:09 FCrane

Firstly, which version of WinDivert did you use? Some of the older versions have performance problems that have been fixed.

These are my test results for WinDivert1.2.0-rc:

direct         : inbound=206Mbps outbound=206Mbps
passthru true 4: inbound=205Mbps outbound=178Mbps

200Mbps = ~25MB/s is less than your 90MB/s, and I have not yet tested anything higher.

There is a performance hit for outbound traffic (206 vs 178Mbps, about ~15%). This is something I was aware of but have never found the exact cause. A possible culprit is the checksum recalculation & this may also explain some of the CPU usage. Unfortunately correct checksums are a requirement of the underlying WFP framework as far as I can tell. WinPkFilter is a lower-level NDIS intermediate driver and probably does not need checksum recalculation for a passthru-type example.

The other thing is that WinDivert has always been a convenience versus performance trade-off. For best performance, you are better off implementing a specialized filtering driver for your application.

basil00 avatar Sep 05 '15 05:09 basil00

I'm using WinDivert v1.18. It's a pity that it slows down gigabit networks, because otherwise it seems really great!

Maybe you can test it on a gigabit LAN to see yourself.

FCrane avatar Sep 05 '15 06:09 FCrane

Did you ever find anything to improve the performance? I'm currently experiencing a similar performance drop and high CPU usage. In my case my download speed goes from ~6MB/s to 4.5MB/s with 15% CPU (probably depends on the CPU). The application spends most of its time in the WinDivertSendEx method. I already increased both available parameters (WINDIVERT_PARAM_QUEUE_LEN and WINDIVERT_PARAM_QUEUE_TIME) but that does not make a lot of difference I'm afraid.

ghost avatar Sep 22 '16 09:09 ghost

Hi!

Sorry, but no. This problem seems to be by design and the developers don’t seem to be interested to fix it.

Regards!

From: Areithus [mailto:[email protected]] Sent: Donnerstag, 22. September 2016 11:05 To: basil00/Divert [email protected] Cc: FCrane [email protected]; Mention [email protected] Subject: Re: [basil00/Divert] Performance impact? (#52)

Did you ever find anything to improve the performance? I'm currently experiencing a similar performance drop and high CPU usage. The application spends most of its time in the WinDivertSendEx method. I already increased both available parameters (WINDIVERT_PARAM_QUEUE_LEN and WINDIVERT_PARAM_QUEUE_TIME) but that does not make a lot of difference I'm afraid.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/basil00/Divert/issues/52#issuecomment-248848671 , or mute the thread https://github.com/notifications/unsubscribe-auth/ANeBiOwx8yOi1e7Ar5LNxNSEjG06SEIMks5qskTVgaJpZM4F307e . https://github.com/notifications/beacon/ANeBiDM_FtwI0hdtO3phhhs7M_-KVsZMks5qskTVgaJpZM4F307e.gif

FCrane avatar Sep 22 '16 12:09 FCrane

Hi @FCrane and @basil00 , I'm interested to write a packet filter which will block outbound traffic from almost 162K IPs. For that purpose, instead of diverting all traffic and then filtering I've come with a scheme:

  • For TCP divert only SYN packets (dropping SYN packets wouldn't make the connection) and for UDP and other divert all packets (filter: "(outbound and tcp.Syn) or (outbound and (!tcp))" , is filter correct?).
  • Catch diverted traffic (with 4-8 threads) and perform filtering using binary search, if any diverted packet is from the blacklisted IPs block it, or reinject otherwise.

Please comment about my scheme, would it work? Additionally, please let me know the tools and all other things you used to measure the performance drop. Help me in creating the environment to measure performance, I want to test my scheme's throughput. Thanks !

u-riaz avatar Sep 27 '16 12:09 u-riaz

Generally you want to divert as little traffic as possible to get the job done. Diverting only SYN packets is a good approach and should have minimal impact, although this will not affect established TCP connections. For UDP, that does not have a SYN equivalent, you'd be stuck with diverting everything or implementing something complex (e.g. update the filter string to whitelist established UDP flows).

basil00 avatar Sep 28 '16 13:09 basil00

This problem seems to be by design and the developers don’t seem to be interested to fix it.

Nothing can be done until #53 is fixed anyway.

basil00 avatar Sep 28 '16 13:09 basil00

@basil00 , Thanks for your reply. Would you please let me know the tools and all other things you used to measure the performance drop (which you and @FCrane measured/tested and mentioned in upper comments). Help me in creating the environment to measure performance, I want to test my scheme's throughput.

u-riaz avatar Sep 28 '16 14:09 u-riaz

For latency use ping, and for throughput just use any file transfer tool (ftp, scp, or even there are some http speed testers if you google for them) will do.

The performance impact of WinDivert is usually minimal unless you are attempting to divert megabytes per second of data through a user application. This is especially true for latency, where the the lag introduced by the user application is usually insignificant compared to the normal network lag. One danger for throughput is if the WinDivert packet queue getting overwhelmed resulting in packet loss.

basil00 avatar Oct 01 '16 01:10 basil00

https://github.com/basil00/Divert/issues/52#issuecomment-137912410

It would be helpful to know if the performance hit is constant with handicapped transfers with varying granularity of handicapping.

satnatantas avatar Dec 06 '16 21:12 satnatantas

The latest WinDivert source code seems to be about ~4x faster than older versions 1.1.X and 1.2.Y, at least with my quick-and-dirty testing. This might not be quite gigabit speeds but at least it is a lot closer. The better performance is mainly due to internal driver optimizations such as avoiding copying packets (where possible) and instant injection.

basil00 avatar Oct 16 '17 15:10 basil00

Nice @basil00! 4x more throughput or 4x less CPU usage? Or both? 👍

ghost avatar Oct 17 '17 10:10 ghost

@Areithus In my experience CPU is nearly all on the user, for the diversion process CPU usage is nil when using overlapped functions and tracking TCP packet flows. Also I believe, given the context of the thread, it's about throughput.

This is exciting news. I've had word @basil00 that EV cert was granted and in the mail to someone I'm working with so we should be able sign shortly.

TechnikEmpire avatar Oct 17 '17 10:10 TechnikEmpire

Yes I meant 4x throughput, although it was a very rough test. I was testing 1Gbps speed, and version 1.2.0 choked at about 170Mbps, whereas the new version managed 630Mbps (still not perfect but much better). But this is just one quick test.

This is exciting news. I've had word @basil00 that EV cert was granted and in the mail to someone I'm working with so we should be able sign shortly.

Let me know when you are ready and I can assist. My other sponsor signed version 1.3.0 but it was a long and painful process, but we gained much experience. From the project's perspective there is no harm in more than one sponsor :)

basil00 avatar Oct 17 '17 13:10 basil00

What was the CPU and it's load when you tested it? Did you test passthru?

satnatantas avatar Oct 18 '17 12:10 satnatantas

Yes passthru. The test box is an old system, so might do better with more modern CPUs.

basil00 avatar Oct 19 '17 23:10 basil00

Some benchmarks for passthru true at gigabit speeds:

------------------------------------------------------------------
Direct:

##: 0.80 Gbps down, 0.93 Gbps up

------------------------------------------------------------------
WinDivert-1.2.0-rc (#threads)

#1: 0.06 Gbps down, 0.06 Gbps up
#2: 0.12 Gbps down, 0.11 Gbps up
#3: 0.16 Gbps down, 0.15 Gbps up
#4: 0.19 Gbps down, 0.18 Gbps up

------------------------------------------------------------------
WinDivert-1.3.0 (#threads)

#1: 0.41 Gbps down, 0.49 Gbps up
#2: 0.71 Gbps down, 0.81 Gbps up
#3: 0.76 Gbps down, 0.87 Gbps up
#4: 0.73 Gbps down, 0.83 Gbps up

------------------------------------------------------------------
WinDivert-1.4.0-dev (#threads)

#1: 0.39 Gbps down, 0.46 Gbps up
#2: 0.61 Gbps down, 0.75 Gbps up
#3: 0.77 Gbps down, 0.84 Gbps up
#4: 0.74 Gbps down, 0.77 Gbps up

Notes:

  • WinDivert-1.2.0 and earlier had a performance bug that limited throughput to around ~200Mbps. Coincidentally, this was my available bandwidth at the time, so the problem went unnoticed.
  • The performance bug was fixed here. The fix was included in the WinDivert-1.3.0 release. The disadvantage of the fix is that WinDivertSend() will not return an error code if the injection fails (instead the packet will silently disappear).
  • The performance has slightly regressed in WinDivert-1.4.0 (although for 3 threads it is about the same). I will continue to investigate.

basil00 avatar Oct 25 '17 01:10 basil00

There is no MSVC build for WinDivert 1.3.0?

lumogate avatar Oct 25 '17 11:10 lumogate

@kelvinomolumo did you check the releases page?

TechnikEmpire avatar Oct 25 '17 11:10 TechnikEmpire

Nice test @basil00, I did some testing here as well (just a little with reading) and can confirm that 1.3.0 is faster than 1.4.0. Not just the throughput but also CPU usage is a little less (about 1-2% less).

ghost avatar Oct 25 '17 13:10 ghost

There is no MSVC build for WinDivert 1.3.0?

No, try to link against the MINGW version.

can confirm that 1.3.0 is faster than 1.4.0

Version 1.4.0 has a more complicated pipeline, so is probably a bit slower as a result. The details are somewhat technical, but version 1.3.0 queues packets (by deep copying) at DISPATCH_LEVEL, which is not ideal (at DISPATCH_LEVEL the thread is uninterruptible, so nothing else can run until the copying has finished). Version 1.4.0 fixes this by moving the copying and filtering out-of-band and runs at PASSIVE_LEVEL (is interruptible, just like normal user-mode code), but this requires an extra queue internally, so likely adds some overheads.

A more optimal design (in terms of performance) would be to not to use deep copying for queueing packets at all, but rather keep a reference to the original packet (NET_BUFFER_LIST). However, drivers are not supposed to keep references to NET_BUFFER_LISTs for long, such as waiting for a user mode application (as is the case with WinDivert), and Microsoft specifically advise against this.

basil00 avatar Oct 26 '17 00:10 basil00

@basil00 I've been reading up a little on this (I think you refer to this specifically: https://msdn.microsoft.com/en-us/library/windows/hardware/ff551206(v=vs.85).aspx and also on some other pages such as https://msdn.microsoft.com/en-us/library/windows/hardware/ff551134(v=vs.85).aspx). Please correct me if I'm wrong though. It seems that if you listen to IRP_MN_QUERY_POWER you can keep the references. Might be worth looking into, it'd be nice to get near gbit speeds with just 2 threads.

ghost avatar Oct 27 '17 07:10 ghost

That might be something to look into.

I also remembered that there are other complications to consider. Specifically, while deep copying sounds slow, it also has the benefit of freeing up the original buffer. This means that WSASend can complete immediately rather than blocking until WinDivert dereferences the packet. This can result in better throughput.

basil00 avatar Oct 27 '17 15:10 basil00

The latest WinDivert-1.4-dev has reverted back to deep copying rather than referencing packets. It appears this mode is actually slightly faster:

#3: 0.80 Gbps down, 0.87 Gbps up

So there is no reason not to continue using this mode for the immediate future. I hope to release version 1.4 shortly.

basil00 avatar Jan 15 '18 14:01 basil00

@basil00 is the WinDivertSend back to how it was working in v1.3 as well with no error if injection fails?

lumogate avatar Jan 15 '18 20:01 lumogate

Since version 1.3.0 the WinDivertSend function will return immediately since this is a lot faster. If you prefer to wait for an error code, it is possible to pass the WINDIVERT_FLAG_DEBUG flag to WinDivertOpen and this will emulate the old behavior. Note that, in my experience, Windows often does not return an error if injection fails either way...

basil00 avatar Jan 15 '18 20:01 basil00

I had evaluated the WinDivert 2.0 performance as part of testing, so it is probably worthwhile to make some quick notes here.

One problem was that I was unable to replicate the pervious performance numbers for older versions of WinDivert. It is possible that WinDivert performance took a hit from the Meltdown mitigation, and especially since my test box uses older hardware. I was also unable to replicate the top speeds for the unfiltered connection either, which may be related, or may have been a temporary network issue.

Nevertheless, we can relative evaluate the performance of WinDivert 2.0, and it essentially matches 1.4.3 using the same parameters (i.e., same thread count), which is in line with expectations.

WinDivert 2.0 also introduces "batch mode" using the WinDivert...Ex() functions. This allows the user application to send/receive multiple packets at one, and significantly reduces the number of kernel/user-mode context switches required. In my experiments, batch mode can significantly improve performance improvement even for single-threaded applications. Using the 2.0 version of passthru, the following configuration (using a single thread and batch of 32) can run at "full speed" (~0.83Gbps filtered vs ~0.93 unfiltered):

passthru.exe true 1 32

This suggests that "batch mode" is the most important factor in terms of performance improvement in recent versions of WinDivert.

basil00 avatar Mar 31 '19 18:03 basil00

Hi basil,

I am suffering from the performance issue now. Actually, we focus on the SMB performance (aka network share). Test env: Two Windows virtual machines (Win10 and Win7) connected with each other under Parallels Desktop

Without using Windivert, the file copying speed can be 150MByte/s (the virtual network adapter is 10Gbps) With using Windivert, passthru.exe (both 1.4.3 and 2.0.0 rc), the speed I can get is at most 50MByte/s The best performance with passthru is with 1 or 2 threads: passthru true 1 for 1.4.3 passthru true 1 32 for 2.0 rc

Although this is tested under virtual machines, but I got similar results with Physical machines with 1Gbps connection.

Could you please shed some light on this? How can I debug this issue?

Thanks

haohaolee avatar May 10 '19 12:05 haohaolee

Just so you know, its a known issue to some of us that SMB and other windows services suffer degraded performance. However I haven't retested using batching.

Personally I exempt such traffic but that may not be an option based on your use case.

TechnikEmpire avatar May 10 '19 12:05 TechnikEmpire