FrameworkBenchmarks icon indicating copy to clipboard operation
FrameworkBenchmarks copied to clipboard

Investigate faf results on Plaintext

Open sebastienros opened this issue 2 years ago • 38 comments

It's currently over 8M RPS which is above the theorical limit of the network.

sebastienros avatar Jun 07 '22 22:06 sebastienros

I think the date is empty as the variable is never initialized: https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/Rust/faf/src/main.rs#L43

The initialization seems to have been removed in this PR: https://github.com/TechEmpower/FrameworkBenchmarks/pull/6523/files#diff-e828d6a53c980b3107862cfa2530ba397b76f56d2a0bb3d555465a6f6217f4feL52-L53

That would explain the network bandwidth difference.

sebastienros avatar Jun 07 '22 22:06 sebastienros

/cc @errantmind

sebastienros avatar Jun 07 '22 22:06 sebastienros

Please look at the source for the framework here, the date value is passed along from the main event loop, also the date header validation does exist in the tfb toolset.

sumeetchhetri avatar Jun 09 '22 04:06 sumeetchhetri

Thanks for confirming. Do you know how it's getting over the "physical" limits then? If not then I will try to check the full payload it's sending so I can understand.

sebastienros avatar Jun 09 '22 04:06 sebastienros

could process priority affect throughput?

fafhrd91 avatar Jun 09 '22 05:06 fafhrd91

@fafhrd91 historically all plaintext results have been capped at 7M rps because when you look at the minimum request size and the network packets, over a 10Gb/s network, it can't be faster than that.

This was confirmed by:

  • stripping the /paintext url to /p (and triggered a new rule in the specs)
  • using smaller Server: ... headers (and triggered a new rule in the specs)
  • using a faster network card (in our Microsoft's lab) which improved all plaintext results for top frameworks (up to 14M rps) without changing anything else.

I am not excluding any magic trick though, even at the network level, which could be a totally fine explanation.

sebastienros avatar Jun 09 '22 15:06 sebastienros

@sebastienros @fafhrd91 @sumeetchhetri @errantmind

i've done a little investigation here and it seems each faf request from wrk is 87 bytes and each response back is 126 bytes.

so, if we assume for argument sake an average ethernet packet size of 8192 bytes, that would allow 64 responses per packet plus 66 bytes of protocol overhead for TCP/IP/Ethernet, giving us an overhead of ~1 byte per response for an overall average including protocol overhead of 127 bytes per response.

if we divide 10,000,000,000 (10Gb) bits by (127 * 8 = 1016) bits we get a theoretical max throughput of 9,842,519 responses per second on a 10Gb ethernet interface.

so, i think it is possible that the faf solution, due to it's extreme low level and focus on shaving every possible microsecond off the server side processing overhead could be significantly faster than all the other frameworks no? it does seem to skew the results on the TE charts though so maybe it should be tagged as stripped instead of realistic? Screenshot from 2022-06-15 21-27-31

i also can confirm that the responses are adhering to the rules and the date field does change at least every second.

billywhizz avatar Jun 15 '22 22:06 billywhizz

@sebastienros are my calculations above correct? as far as i understand, they would assume full duplex communication between web server and benchmark server so that would mean theoretical throughput of 10Gb up and 10Gb down simultaneously yes? if not, then i can't understand how we could even reach 7m RPS. :man_shrugging:

billywhizz avatar Jun 16 '22 14:06 billywhizz

I didn't see this right away, so sorry for the delay.

There is no single secret that accounts for the difference in performance between faf and other projects. Each component in faf has been hand optimized to some degree.

If anyone is extra curious they can test each component individually against what they have. The source code is a little dense but has decent separation-of-concerns.

errantmind avatar Jun 17 '22 08:06 errantmind

@errantmind i will take a look. i find rust syntax so horrific to look at though. i'll see if i can put together something in plain old C that would achieve the same throughput.

@sebastienros @nbrady-techempower can anyone confirm re. my questions/calculations above? i had thought, as sebastien asserted above, that 7m RPS was some kind of hard limit based on the hardware but it seems not. :man_shrugging:

billywhizz avatar Jun 17 '22 14:06 billywhizz

Sorry, haven't had time to look into it, but I had done the same calculation 3 years ago (we recorded it I think) and we confirmed the 7M. I had looked at wireshark traces too because we didn't understand why all results capped at 7M. Then we ordered new NICs (40gb/s) and magically the benchmarks became much faster. All these benchmarks that are currently at 7M reach 10M+ when using a faster card, with the same hardware (we own the same physical machines as TE), so I don't think faf is just faster because it's "more" optimized than other fast frameworks. But until proven otherwise this is what it is, and I totally understand that there are some tests to validate the results. Hope to find some time to investigate more soon.

sebastienros avatar Jun 17 '22 17:06 sebastienros

@billywhizz

if we assume for argument sake an average ethernet packet size of 8192 bytes

MTU is 1500 on these NICs, including the ones we use.

sebastienros avatar Jun 17 '22 17:06 sebastienros

@sebastienros ok, so if my quick calculation is correct that would make it a theoretical max RPS of ~9.4m.

each packet = (11 * 126) + 66 = 1452 bytes = 11616 bits 10bn bits / 11616 = 860881 packets 860881 packets * 11 = 9,469,696 RPS

assuming full duplex and that downstream link can use the full 10Gb. does that sound correct?

billywhizz avatar Jun 17 '22 17:06 billywhizz

I took some tcp dumps and all looks fine. Compared to ULib (c++) also which used to be the fastest on our faster network.

The important things I noted are

  • The headers Server : F vs Server: ULib, which is saving 3 bytes.
  • The packets are smaller for ULib, so it's more common to have the responses split in two packets vs. FAF
  • All pipelined requests fit in one packet of 2720B, for both. Even though the MTU is 1500. I read there are some network optimizations that make it possible.

But something that you should re-calculate, is that the requests are 170B, so these should be the network bottleneck (if it is one):

From the client each packet is 2720B (16 requests of 170B) plus the preamble (66B) which makes 2786B to send 16 requests. The NICs are full-duplex so 10Gb up AND down. 10^9/2786 = 358937 groups of 16 requests per second 358937 * 16 = 5,742,992 RPS

And this is assuming there is no lost/empty packets (and there are)

sebastienros avatar Jun 17 '22 19:06 sebastienros

@sebastienros

thanks for getting back. that's odd - when i run locally and do a single http request the http payload is only 126 bytes. not sure where 170 bytes is coming from. thanks for confirming re. full duplex. i think my calculation is correct then if we can resolve the 126 bytes versus 170 bytes you are seeing for http payload? Screenshot from 2022-06-17 20-40-45

billywhizz avatar Jun 17 '22 19:06 billywhizz

I am talking about the request payload, not the response.

sebastienros avatar Jun 17 '22 19:06 sebastienros

@sebastienros ah sorry - i missed that. yes - request payload when running with the 'Connection' and 'Accept' headers in tfb is indeed bigger - i had assumed it would be smaller than response. sorry for confusion. :pray:

the numbers above are a little off. it works out as max ~7m request per second as we had initially assumed.

Screenshot from 2022-06-17 21-12-59

so, we no closer to having an answer then? :thinking: it must be :magic_wand: :smile:

billywhizz avatar Jun 17 '22 20:06 billywhizz

if you wanted to, could improve the baseline significantly by removing the redundant 'Connection: keep-alive' header in the request - in HTTP/1.1 keep-alive is assumed in the absence of this header.

you could also drop the 'Accept' header altogether. removing these two headers would knock 120 bytes off each request - they are most likely ignored anyway by most if not all the frameworks in the test. :man_shrugging:

billywhizz avatar Jun 17 '22 20:06 billywhizz

Did we ever get an up-to-date hard-confirmation that each (pipelined) request is received is 2720 bytes? It has been a while since I've done any testing but I have a recollection the requests were a lot smaller than that. Either way, can you provide the exact bytes you are seeing in each pipelined request?

errantmind avatar Jun 17 '22 23:06 errantmind

@errantmind

i just ran the pipelined test locally and you can see each pipelined http request is 165 bytes. on my machine the packets all seem to be 2706 bytes - 2640 payload and 66 bytes for Eth/IP/TCP preamble - so 16 http requests per packet as expected for pipeline level of 16. Screenshot from 2022-06-18 01-53-53

so, that gives us a theoretical max requests per second of (10,000,000,000 / (2706 * 8)) * 16 = 7,390,983, excluding other protocol overhead for TCP (e.g. 40 bytes for ACK on each packet), so likely ~7.2/7.3m request per second excluding that overhead.

billywhizz avatar Jun 18 '22 01:06 billywhizz

so likely ~7.2/7.3m request per second

The same initial conclusion we got to a few years ago. (sorry I forgot about bits/bytes) Which is why I am skeptical about the numbers, and validated by more than 10 other frameworks results.

But from what I have seen so far there is nothing abnormal with FAF, this is getting very interesting.

sebastienros avatar Jun 18 '22 01:06 sebastienros

@billywhizz Interesting, thank you for posting the info.

I'm going to have to think about this for a while. I've gone through the code multiple times today and tested it in various ways, and I don't see any issues there, so I'm at a bit of a loss. Although not totally relevant, I'm also using FaF in a couple production environments that serve up a decent chunk of traffic and have also not noticed any issues over the months it has been in use.

errantmind avatar Jun 18 '22 01:06 errantmind

@errantmind

yes - i think both myself and @sebastienros are in agreement that nothing seems to be wrong with the code or what we are seeing on the wire. am wondering could it be some kind of issue/bug with wrk and how it's counting pipelined responses? i'm kinda stumped tbh. :confused:

billywhizz avatar Jun 18 '22 02:06 billywhizz

There is one thing that is different for faf. I assume most frameworks use port 8080, but faf is using 8089. @errantmind would you like to submit a PR that makes it use 8080 instead, I believe it's only in 3 locations (main.rs, docker file expose, and benchamrksconfig.json). There is no reason it would change anything, right? But wrk ... who knows

sebastienros avatar Jun 18 '22 02:06 sebastienros

@errantmind @sebastienros

one thing i am thinking on here.

  • we have established that max rps from client to server is ~7.2m
  • we have established that max rps from server to client is ~9.4m

so, i think it might be possible that faf, due to a bug, is writing more responses than requests it receives and that wrk terminates early when it receives expected number of responses to requests it has sent (some of which could actually still be in flight, or possibly not even sent yet depending on how wrk "works" internally). this seems to me the only thing that could explain the numbers we are seeing. there's no system level "trick" afaict that would allow sending more than ~7.2m requests per second if the numbers we discussed above are correct.

does that make sense?

i'll see if can reproduce this and/or have a dig into the source code of wrk to see if it's possible/likely.

billywhizz avatar Jun 18 '22 03:06 billywhizz

@billywhizz let me know what you find. If there is a bug, then I'll submit a PR to fix it. If not, I'll change the port.

errantmind avatar Jun 18 '22 03:06 errantmind

@errantmind @sebastienros

i did some testing here with a proxy between wrk and faf and always see same amount of responses sent out of faf as requests that were received so doesn't look like my theory is correct. :man_shrugging:

billywhizz avatar Jun 20 '22 18:06 billywhizz

I'm happy there doesn't appear to be a bug from your tests, but I'd still like an explanation why it exceeds the calculated limits described earlier.

It seems like it is either an even harder to replicate bug, or something is off with the calculation, e.g., the official tfb benchmark isn't sending as many bytes in each request as appears in the repo.. or something along those lines.

Of course, it could be something else entirely as well. I don't have time at the moment to mount a full-scale investigation of my own, but if anyone eventually provides a full explanation, it will be appreciated, whatever the reason is.

errantmind avatar Jun 21 '22 05:06 errantmind

I did look at wrk and seems to be counting of request per thread can be a problem. Reading of data from request can have potential issues, if record_rate would be called with some interleaving. https://github.com/wg/wrk/blob/a211dd5a7050b1f9e8a9870b95513060e72ac4a0/src/wrk.c#L273-L289 as well as request counting https://github.com/wg/wrk/blob/a211dd5a7050b1f9e8a9870b95513060e72ac4a0/src/wrk.c#L330-L331

For example stats_record itself protected from race conditions. https://github.com/wg/wrk/blob/a211dd5a7050b1f9e8a9870b95513060e72ac4a0/src/stats.c#L22-L31

From what I understand if that would be the case, then RPS would increase (which is strange). I can have fix for wrk if that's appropriate, but I have no means to run TB suite to validate changes.

kant2002 avatar Jul 05 '22 05:07 kant2002

As I posted on Twitter in reply to Egor Bogatov, my first thought is that there may be an error either in the composition of responses or in the way wrk counts responses, as @kant2002 discusses above.

I'd recommend reviewing the total received bytes reported by wrk versus the expected total received size of N responses multiplied by M bytes per response. My hunch is that these two values won't align.

bhauer avatar Jul 05 '22 18:07 bhauer

I'm looking at numbers which wrk displays for local testing on my very old laptop. Admittedly not best target for testing. What I observe in output

Running 15s test @ http://tfb-server:8089/plaintext
  4 threads and 4096 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    27.20ms   12.84ms 127.56ms   56.39%
    Req/Sec   177.73k    64.61k  309.83k    60.81%
  Latency Distribution
     50%   36.82ms
     75%   53.86ms
     90%    0.00us
     99%    0.00us
  10375504 requests in 15.09s, 1.22GB read
  Socket errors: connect 0, read 0, write 5024, timeout 0
Requests/sec: 687757.44
Transfer/sec:     82.64MB
STARTTIME 1657083529
ENDTIME 1657083544

I cannot connect 3 numbers in any meaningful way, Requests count, total execution time and Requests/per second, and Request/per second per thread. All calculations seems to be off.

10375504 requests in 15.09s mean that Requests/sec is 10_375_504 / 15.09 = 687_574.818 which is different from reported number of 687_757.44 significantly (183 PRS missing somehow). 10375504 requests in 15.09s also mean that average per thread RPS is 10_375_504 / ( 15.09 * 4 ) = 171_893.704 which is significantly less then 177.73k reported by wrk. If I try connect reported Requests/sec: 687757.44 to avg thread RPS naively 687_757.44 / 4 = 171_938.61.

I may attribute number dfferences to https://github.com/wg/wrk/blob/a211dd5a7050b1f9e8a9870b95513060e72ac4a0/src/wrk.c#L277-L278

but I would like that somebody find a flaw in my reasoning, or point on the error in interpretation wrk results.

kant2002 avatar Jul 06 '22 05:07 kant2002

@kant2002 the Requests/sec metric includes the time required to receive responses as well or is just considering the "test duration time"?

franz1981 avatar Jul 06 '22 08:07 franz1981

In reality no fw reaches 7M req/s, only faf with 8.6M.

If we check these numbers with such precision, we must know where they come from.

TechEmpower toolset numbers

Historically the numbers in the benchmark graphs are a bit higher, than in reality. But it affects all fws similarly. Wrk return the Requests/sec, but they are different from the ones shown in the charts.

Why?

In wrk req/s are included also the non-2xx or 3xx responses or socket errors (in plain text with pipeline). We want only the correct req/s, but wrk don't have this option. An Nginx can return very fast 500 or 502, with a bad app behind.

So they take the total number of requests and subtract the errors, and divide it by 15 seconds to show the charts numbers. ( total requests - errors) / 15 seconds If you check the results.json, only exist totalRequests and 5xx(if there are errors) but not req/s from wrk.

It seems correct, but wrk never run exactly 15 seconds (~ from 15.00 - 15.14s).

image The STARTTIME and ENDTIME, are only cosmetic and not used. They have a second precision only.

Diff

When there are no errors, it is easy to see the difference.

Framework TechEmpower req/s Real Wrk req/s Total req/15s Link
Faf 8,672,182 8,614,646.29 130,082,744/15 = 8,672,182.93 link
Pico.v 7,028,146 6,981,694.59 105,422,192/15 = 7,028,146.13 link
may-minihttp 7,023,484 6,977,067.30 105,352,272/15 = 7,023,484.8 link
asp.net core [platform] 7,023,107 6,976,690.42 105,346,616/15 = 7,023,107.73 link

Wrk

@kant2002 About the numbers returned by wrk You need to calculate them in another way. (Total requests / Requests/sec ) = Seconds (rounded)

10375504 / 687757.44 = 15.08599310826794, and rounded is 15.09s.

joanhey avatar Jul 06 '22 23:07 joanhey

@joanhey I did find yestedday that bottom numbers are really slightly different because of precision display, but was too tired to get back. Still difference between total RPS and Per Thread RPS is still not clear for me. The total test duration is time between starting first thread, and time when last thread process last request.

Also if follow links above wrk report for all frameworks that it receive 12Gb for ~15 sec, while FAF receive ~15Gb of data, and I would like to examine that. wrk seems to be report only number of received data and not number of sent data, even if it collects that. I become suspicious when numbers are not adds up to direct calculations.

Okay, Let's take a look at reply of @billywhizz https://github.com/TechEmpower/FrameworkBenchmarks/issues/7402#issuecomment-1159327062

~~On the screen size of the response is 149 bytes and not 165, so that leads to 16*149+66 = 2450 for pipeline of level 16 and thus 10_000_000_000/(2450*8)*16 = 8_163_265.30612245 which seems to be closer to the numbers which all we are seeing. And somehow network cards use 1024bits instead of 1000 for the base then we have number 10*1024*1024*1024/(2450*8)*16 = 8_765_239.37959184. Last calculation is purely speculation, but that way it may explain things.~~

kant2002 avatar Jul 07 '22 01:07 kant2002

Seems to be I should not do anything too early in the morning. Ignore last message, except part with Per Thread RPS. I do not understand why number are bigger then Total PRS / Threads count.

kant2002 avatar Jul 07 '22 01:07 kant2002

@kant2002 You will get more information directly from the wrk repo. https://github.com/wg/wrk/issues/259#issuecomment-359228045

joanhey avatar Jul 07 '22 19:07 joanhey

one thing i have found when testing with wrk is the number of requests it reports is literally never the actual number of requests the server received. from tests i have done is undercounting by ~3%. i'll see if i can spend some time over next week or two investigating this issue further - it's definitely a weird one! Screenshot from 2022-07-09 08-00-15

this script is useful for seeing RPS across threads but it seems to undercount requests in same way, so something unusual seems to be happening inside wrk itself.

local counter = 1
local threads = {}

--wrk.method = "PUT"
--wrk.headers["Content-Type"] = "application/json"

function setup(thread)
  thread:set("id", counter)
  table.insert(threads, thread)
  counter = counter + 1
end

response = function(status, headers, body) 
  if statuses[status] == nil then
    statuses[status] = 1
  else
    statuses[status] = statuses[status] + 1
  end
end

done = function(summary, latency, requests)
  for index, thread in ipairs(threads) do
    local statuses = thread:get("statuses")
    local id = thread:get("id")
    for key, value in pairs(statuses) do 
      io.write(string.format("Thread: %d, %s: %d\n", id, key, value))
    end
  end
  for _, p in pairs({ 50, 75, 90, 95, 99, 99.999 }) do
    n = latency:percentile(p)
    io.write(string.format("%g%%,%d\n", p, n))
  end	
end

init = function(args)
  statuses = {}
  local r = {}
  local depth = tonumber(args[1]) or 1
  for i=1,depth do
    r[i] = wrk.format()
  end
  req = table.concat(r)
end

request = function()
  return req
end

billywhizz avatar Jul 09 '22 07:07 billywhizz

General FYI: I am actively looking into this today as it is annoying not to know what is going on here. I have a few suspicions and will post an update here today/tomorrow with whatever I find.

In the meantime I ask that FaF is excluded from this round of the official results until we have an answer here. @nbrady-techempower

errantmind avatar Jul 17 '22 22:07 errantmind