FrameworkBenchmarks
FrameworkBenchmarks copied to clipboard
physical plaintext network bottleneck?
The benchmarks show best performing plaintext on physical to be at 2.7M rps. The requests have about 400B (TCP content only). So 2.7M rps is about 8.6Gbps. This seems close to the 10Gbps of the ethernet switch. Perhaps the physical benchmark is constrained by the network bandwith?
This is possible.
For what it's worth, our new hardware environment has a switch with 6x QSFP-100G ports if you were looking to donate three network cards and a few cables 😅
For what it's worth, our new hardware environment has a switch with 6x QSFP-100G ports if you were looking to donate three network cards and a few cables
lol. If you have some spare 10G cards, maybe you can bond them.
The cloud benchmark for plaintext has tokio-minihttp at 64.5% of ulib. These two are at the top of the physical benchmark (100% and 97.8%). I'm curious how much the graph would rescale.
We'll see when you find a sponsor :)
lol. If you have some spare 10G cards, maybe you can bond them.
The 10G cards in the servers we have are double-NIC, but I don't know enough about the hardware to be certain that bonding both NICs on the same card would be an improvement.
The cloud benchmark for plaintext has tokio-minihttp at 64.5% of ulib. These two are at the top of the physical benchmark (100% and 97.8%). I'm curious how much the graph would rescale.
Agreed, though I have learned over the years that nothing really beats C in terms of performance when written well. That said, I personally want Rust to compete.
We'll see when you find a sponsor :)
I'm out there looking!
@msmith-techempower link aggregation is definitely something you should look into for getting 20gbps between servers instead of just 10.
The benchmarks show best performing plaintext on physical to be at 2.7M rps.
I can see 9M RPS with the latest runs on Citrine ... how is that possible then?
I can see 9M RPS with the latest runs on Citrine ... how is that possible then?
Indeed. Citrine must have more than 10G then.
I'm looking a bit at these daily results (https://tfb-status.techempower.com/). The variation across runs is huge.
plaintext:
Date | Netty | aspnetcore |
---|---|---|
16/04 | 5.5 Mrps (35.7ms) | 3.0 Mrps (156ms) |
10/04 | 3.4 Mrps (2500ms) | 2.7 Mrps (150ms) |
21/03 | 5.7 Mrps (35.7ms) | 2.4 Mrps (51.1ms) |
Correct my maths if they are wrong, but using Octane's numbers from the first good Citrine run:
=============================
Octane(plaintext) on Citrine
=============================
9,346,826 responses per second
=============================
Octane(plaintext) response
=============================
HTTP/1.1 200 OK\r\n
Server: octane\r\n
Content-Type: text/plain\r\n
Content-Length: 13\r\n
Date: Thu Apr 12 16:18:26 2018\r\n
\r\n
Hello, World!
=============================
126 total bytes received
9,346,826 * 126 =
1,117,700,076 bytes per second =
8,941.600608 Mbits per second =
8.9416 Gbits per second
This seems to make sense on the 10Gb.
@msmith-techempower the math is good for the response. For the request there are about 400 bytes:
GET /json HTTP/1.1
Host: server
User-Agent: Mozilla/5.0 (X11; Linux x86_64) Gecko/20130501 Firefox/30.0 AppleWebKit/600.00 Chrome/30.0.0000.0 Trident/10.0 Safari/600.00
Cookie: uid=12345678901234567890; __utma=1.1234567890.1234567890.1234567890.1234567890.12; wd=2560x1600
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Connection: keep-alive
Maybe this one is stuck at 40G then :smile:
Is that an actual request sent via wrk
?
The request is much smaller than that on plaintext when we use work.
Is that an actual request sent via wrk?
You can send that via wrk. I copied the request from https://www.techempower.com/benchmarks/#section=code
I think that is a simple example request for testing, but not actually representative of what wrk
does. I am sorry for the confusion, but that documentation was written long before we even started using wrk
(we were using some benchmarker from apache
at one point... I don't even remember... it was like 7 years ago, even before we moved into the open source space).
That said, with 10Gb ethernet and full duplex, I don't really see anything suggesting that we would be performing above 10Gbps (in the theoretical sense). In fact, testing with iperf
confirms this, as well.
Locally I think I got 9.8 Gbs with iperf
and measured a max of 1.5M packets per second. Any benchmark that is over 1.5M is obviously and correctly using pipelining.
Any benchmark that is over 1.5M is obviously and correctly using pipelining.
plaintext
is using pipelining.
Yes I know ;) hence the comment
The requests have about 400B (TCP content only). So 2.7M rps is about 8.6Gbps.
My math assumed 400B requests when creating this issue. So round15 didn't hit 10G. And daily on Citrine may be hitting it?
It would be nice to update https://www.techempower.com/benchmarks/#section=code to reflect the headers/sizes used by wrk.
My math assumed 400B responses when creating this issue.
Understood, may not be accurate. I will check.
It would be nice to update https://www.techempower.com/benchmarks/#section=code to reflect the headers/sizes used by wrk.
Completely agree.
So round15 didn't hit 10G. And daily on Citrine may be hitting it?
Part 1 is definitely true and part 2 is likely for a few extremely high-performance test implementations.
I think the math is off, as networking bandwidth is measured as base 1000 for Kbs -> Mbps -> Gbps, unlike storage which is 1024.
So the theoretical max for plaintext on the network layer is limited by two factors: packets per second, and total bandwidth. Given the responses are pipelined at a depth of 16, the most optimal packet size would be 16 x 126 = 2016 bytes
(not accounting for any overhead). At a switching rate of 1.5 million packets/second (from @sebastienros above), that gives us a max packet throughput rate of 3024000000 byte/S
or 24.192 Gbps
. Now, I don't know what the packet sizes (MTU) are set to, so frameworks are potentially sending more packets than that (due to them being smaller), but they'd need to be less than 1024 bytes to even approach packet switching being the limit, which seems unlikely.
For total bandwidth: 10Gbps is 10,000,000,000 bits per second, or 1,250,000,000 bytes per second
. To get max RPS for plaintext: 1,250,000,000 / 126 = 9,920,634.9 RPS
.
In the latest runs, the top plaintext frameworks seem to be bunching up at around 7,000,000 RPS, which is strange, but quite a way off the theoretical maximum based on above.
@msmith-techempower would be able to elaborate some, but earlier continuous Citrine runs were improperly reporting even higher measured responses per second—above 9 million in the cases of the top performers.
I say these measurements were "improper" because they were collected without the request headers we expect to be sent by the load generator (wrk). This was due to a bug we had introduced during the Docker conversion: the command line arguments specifying the request headers to wrk were not being escaped.
The request headers are significant, and although I have not done the math, my hunch is the requests are longer in total bytes than the responses. After correcting the command line arguments so that wrk is sending the expected request headers, the top performers are clustering at ~7M rps.
We are presently focused on increasing stability across all frameworks in order to wrap Round 16. After that, some further tweaks to the network are being considered which could allow the top performers to be better differentiated.
I believe the request headers for plaintext are currently the following: https://github.com/TechEmpower/FrameworkBenchmarks/blob/e99d22fdd6cf10354e6dd6d4422c64dfd5f28cc2/toolset/benchmark/test_types/plaintext_type.py#L72
That would make the full request the following:
GET /plaintext HTTP/1.1
Host: tfb-server
Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7
Connection: keep-alive
With the terminating CRLF CRLF
that's a total of 167 bytes per request. The math for the request side then looks like: 1,250,000,000 / 167 = 7,485,029.94 RPS
.
That of course is a theoretical limit, so it seems very likely we're bumping up against our max environmental limit for the ingress side of plaintext now.
@bhauer seems like we should prioritize investigating adapter bonding to raise the upper limit to roughly twice what it is now.
Agreed on the correct maths. It actually pointed me in the direction of fixing a bug I introduced into our wrk
image that we resolved some time ago.
As an aside, it seems possible we're also starting to hit limits in the JSON test too, although in that case it seems aligned with the packet switching limit.
@DamianEdwards The math looks solid and conforms with the observed convergence we're seeing in continuous runs. In this recent example continuous run, we see a plaintext convergence at just over 7M. I suspect the theoretical limit is slightly higher than reality since additional bytes are presumably needed for overhead such as frame and packet headers.
Agreed on increasing the network capacity. That said, I'd prefer to get Round 16 finalized before we do that. We will focus on making a "Preview" out the next good continuous run. (To be clear, a "Preview" is not special from a data perspective; I merely think it will be helpful to get the attention of less-active project participants.) Then aim to finalize a week or two after.
Sounds good. 100G fiber here we come 😁
Might have issues currently with the benchmarker becoming the bottleneck; rather than the webservers above 10GbE https://github.com/wg/wrk/issues/337
In the latest runs, the top plaintext frameworks seem to be bunching up at around 7,000,000 RPS, which is strange, but quite a way off the theoretical maximum based on above.
Hey, just wondering why on a such small packet you don't count the IP+TCP overhead of at least 40 bytes, for the small packets in these tests, the overhead is quite significant, no? (and I'm not counting 18 bytes ethernet header + even a micro latency with TCP windows size + all the small percentage of ACKs/packets loss...etc.)
One of the popular switch models I know that has 6xQSFP28 ports (the Nexus 31108PC/TC-V) should really be used in the non-default hardware profile mode when it has only 4xQSFP28 and 2xQSFP+. Maybe the other switches based on the same merchant silicon have these different modes too.
@onyxmaster Thanks for the alert on that. That is the switch model we're using, so we will need to be mindful of the configuration.
Hey, just wondering why on a such small packet you don't count the IP+TCP overhead of at least 40 bytes, for the small packets in these tests, the overhead is quite significant, no?
Plaintext is pipelined so may result in only 2 packets for 16 requests and 2 packets for 16 responses
Yeah the math was just meant to be HTTP only theoretical max, to see if we're in the ballpark, and it's clear we are. Would be interesting to include overheads to see just how close to network theoretical limit we are too.
We're also working to help update the test infrastructure to capture CPU, TX/RX, & packet rate during every run, and include that data in the results.
Sounds good. 100G fiber here we come 😁
@DamianEdwards I finally got some time, so I set up the 100G cards and plugged everything in this morning, but I found that they seem to not be plug-n-play (sort of expected) and Intel does not support Ubuntu server. Have you guys experienced this as well? Any workaround?
I think the worst-case scenario would be that we switch the machines over to a supported CentOS/OpenSUSE/RHE to get driver support. Since everything is done in Docker, I do not really see any cause for concern.
@bhauer asked that I cc @sebastienros for the above concern.
Basically, we want to make sure that there is parity between our environments, so if you guys got the Debian drivers working (I'm not at all sure how or if that's possible), then we would want to stick with that approach, otherwise let's land on some choice to proceed.
I will ask our colleagues who manage the lab, but as of last week I know they had not tried it yet.
This is kind of the most classical flaw of benchmarking - bottlenecking the wrong part and crediting something different.
Michael Jackson looked at the moon, so did Leonardo Da Vinci.
They are both dead.
The moon kills people.
You don't need to chase any particular super network adapter a 20 thousand USD - just lower your CPU-time and you're done. Much cheaper and now you're no longer benchmarking essentially noise.
Any plans to work on this? I'm looking forward to see that plaintext 10xtop-1 become a top-10.
You don't need to chase any particular super network adapter a 20 thousand USD - just lower your CPU-time and you're done. Much cheaper and now you're no longer benchmarking essentially noise.
I agree that lowering the available cpu is an alternative to increasing bandwidth. The numbers for plaintext speak to the imagination, and I guess that is why the desire is to increase bandwidth.
I wrote a post about this just now actually: https://medium.com/@alexhultman/why-people-should-stop-listening-to-techempower-c544e7b538a5
TechEmpower provides a set of tests, and runs frameworks against those tests. While most tests don't reflect real world applications, they do stress particular aspects of the frameworks. The results can then be used to improve those frameworks.
This is very much what we've seen in the .NET Core space. Microsoft actively used TechEmpower benchmarks to improve the framework. And those performance benefits are measurable in the end-user applications.
Ehm, no. The top 13 servers are not capped on CPU, so you can't draw any valid conclusion from them. Did you even read the post?
Yes, that is what this issue is tracking. The top for plaintext physical is noise. Your blog post has a broader message: Why people should stop listening to TechEmpower, which is what I was replying to.
I have a question regarding the "JSON serialization" test - this looks identical to the plaintext test only instead of Hello world it's in JSON. Why are the results 7x lower in the JSON test? Are you not having pipelining there?
Yes, the JSON test runs without pipelining.
It cannot be the case that 8 servers score almost identical at 7 million req/sec.
That is not necessarily the case. However, it can be (and likely is the) case that 8 test implementations are performing at or near the maximum throughput of our hardware at present, and that's still very interesting.
We are at round 17 now, so you've known this for some time and you still post invalid results.
We disagree on this.
It really tears down the validity of TechEmpower altogether.
and this.
It is probably worth mentioning that due to issue #3804, the fastest frameworks in the cached queries test might also be hitting the network bandwidth limit. For example, according to wrk
servlet-postgresql
transfers 1.07 GB/s on average.
Also ulib
is at 1,37M RPS for the case "1 object extracted from the cache" (Edit: currently 16 objects). Which also is very similar to the JSON serialisation test (256 and 512 concurrency). Given the math above (1.5M packets without pipelining) and assuming 1 packet per response, it is very close.
Isnt the solution simple? Just run the benchmarks on slower hardware. or am i wrong?
Idea: Add a new dimension with the number of cores (1, 4, .. MAX). That would show how some frameworks behave in constrained environments.