pistache
pistache copied to clipboard
Comparison with Golang
I was curious how much faster a C++ server would be than a Go server. And I was a little surprised ...
I write equivalent code in both languages. And testing this with apache benchmark.
ab -n 2000000 -c 1000 -k http://localhost:9595/
The golang code looks like this:
import "net/http"
func main() {
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte("Hello World!\n"))
})
http.ListenAndServe("localhost:9595", nil)
}
On C++:
#include <pistache/endpoint.h>
#include <iostream>
using namespace Pistache;
using namespace Pistache;
class HelloHandler : public Http::Handler {
public:
HTTP_PROTOTYPE(HelloHandler)
void onRequest(const Http::Request& request, Http::ResponseWriter response) {
response.send(Http::Code::Ok, "Hello World!\n");
}
};
int main() {
std::cout << "Server begin listening\n" << std::endl;
Address addr(Ipv4::any(), Port(9595));
auto opts = Http::Endpoint::options().threads(6);
Http::Endpoint server(addr);
server.init(opts);
server.setHandler(Http::make_handler<HelloHandler>());
server.serve();
}
ab gave the following results: for C++ : Requests per second: 55138.30 [#/sec] (mean) for Golang: Requests per second: 58193.58 [#/sec] (mean)
I ran the testing several times and always Golang was slightly faster than C++.
What am I doing wrong? How should I properly use multithreading in pistache? Why is C++ lower?
Thank you in advance.
Interesting.
- How many CPU cores are on your server?
- Did you run 'ab' from the same machine? How many cores was it using?
- Might be interesting to run "perf top" on the same machine as pistache and see where the CPU time is going.
- Might be interesting to run pistache under gprof (yes, it will run much slower...) and see where the CPU time is going.
Interesting.
Undoubtedly. :-)
- How many CPU cores are on your server?
On my AMD Ryzen 5 4600h laptop with 6 cores and 12 threads. (as an argument
Http :: Endpoint :: options (). threads (6);
specified both 1 and 12 and 255 - the results were about the same)
- Did you run 'ab' from the same machine? How many cores was it using?
Yes, in both cases (for Golang and C ++) started the server locally and on the same host ab.
I will answer a little later.
Might be interesting to run "perf top" on the same machine as pistache and see where the CPU time is going.
C++ server
Golang server
It seems that the C ++ server doesn't work at all.
Please tell "perf top" to only look at the pistache process, and try to take the screenshot with the "ab" window not covering up the interesting bits of the perf top output.
Will it fit?
Thank you for the report and update. I do not have time to dig deeper at the moment, but hope to within a few days.
Go implements a different threading model than C++17, so that might account for part of it. The hotspot in pistache appears to be heap allocations. Maybe linking against tcmalloc might have a small improvement.
With recent GCCs parallel execution policies are implemented with Intel's Thread Building Blocks. I'm not sure what Go uses.
Ivan: Which compiler are you using, and if its GCC< can you also run a test where you build pistache with a recent clang?
Kip: Do we have any options for automatic performance testing, similar to our CI?
I don't think so. But I suspect @Tachi107 could have fun with that with the new skills he's been learning lately.
I attempted to conduct my own performance testing against my pistacheio application. However, most of my overhead was in getting/releasing connections to postgresql from my connection pooler, and in logging http requests to the same database. I did not use "ab" (Apache Benchmark); instead I used "hey" (https://github.com/rakyll/hey); its functionally identical.
I'll need to compile Ivan's examples and tinker.
Suggestion: We either create a "benchmark" directory in pistacheio/pistache and add Ivan's examples (and possibly one in rust), and code up some sort of little "run it locally" benchmark suite, or create a github project like "pistacheio/benchmarks" and place them there. It would be nice to have a set of sample servers that all return identical results for "HTTP GET /" (one using pistache, and others using other tech), and some framework or scripts for testing them all.
I've modified "examples/hello_server.cc" to use 10 threads. I ran hey
(as follows) and received the following results on an AMD 5950x (16 core, 32 thread, 64GiB ram). My system was not idle though, but had a background load of ~2 when the benchmark was not running.
I'm not familiar with meson
, so I don't know off the top of my head how to compile it with gprof enabled (-g -pg -no-pie
)
$ hey -z 20s -c 100 -cpus 10 http://127.0.0.1:9080/
Summary:
Total: 20.0024 secs
Slowest: 0.1282 secs
Fastest: 0.0001 secs
Average: 0.0053 secs
Requests/sec: 18947.2669
Total data: 4547892 bytes
Size/request: 12 bytes
Response time histogram:
0.000 [1] |
0.013 [328294] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.026 [26007] |■■■
0.039 [14244] |■■
0.051 [6635] |■
0.064 [2611] |
0.077 [821] |
0.090 [223] |
0.103 [104] |
0.115 [32] |
0.128 [19] |
Latency distribution:
10% in 0.0004 secs
25% in 0.0006 secs
50% in 0.0010 secs
75% in 0.0020 secs
90% in 0.0184 secs
95% in 0.0299 secs
99% in 0.0514 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0011 secs, 0.0001 secs, 0.1282 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
req write: 0.0013 secs, 0.0000 secs, 0.0933 secs
resp wait: 0.0006 secs, 0.0000 secs, 0.0686 secs
resp read: 0.0023 secs, 0.0000 secs, 0.1109 secs
Status code distribution:
[200] 378991 responses
I wonder if the meson build uses -O0 or -O2, or some other optimization strategy.
I should install "ab" and test "ab" vs "hey" w/ identical configs and an identical http server, to see if "hey" performs the same as "ab" or not.
@dennisjenkins75
Ivan: Which compiler are you using, and if its GCC< can you also run a test where you build pistache with a recent clang?
Yesterday I used clang version 10.0.0-4ubuntu1
and today after your question I decided to compile with g++ 9.3.0
. The results are roughly the same.
Or need clang version 12
?
I also decided to try using hey:
CPP_SERVER:
▶./hey_linux_amd64 -z 20s -c 100 -cpus 10 http://127.0.0.1:9595
Summary:
Total: 20.0075 secs
Slowest: 0.2453 secs
Fastest: 0.0002 secs
Average: 0.0060 secs
Requests/sec: 16578.0475
Total data: 4311918 bytes
Size/request: 13 bytes
Response time histogram:
0.000 [1] |
0.025 [307081] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.049 [16753] |■■
0.074 [5835] |■
0.098 [1486] |
0.123 [350] |
0.147 [99] |
0.172 [43] |
0.196 [26] |
0.221 [10] |
0.245 [2] |
Latency distribution:
10% in 0.0008 secs
25% in 0.0012 secs
50% in 0.0019 secs
75% in 0.0033 secs
90% in 0.0144 secs
95% in 0.0351 secs
99% in 0.0641 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0017 secs, 0.0001 secs, 0.1927 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
req write: 0.0009 secs, 0.0000 secs, 0.1945 secs
resp wait: 0.0016 secs, 0.0000 secs, 0.0790 secs
resp read: 0.0017 secs, 0.0000 secs, 0.1273 secs
Status code distribution:
[200] 331686 responses
GOLANG_SERVER:
▶./hey_linux_amd64 -z 20s -c 100 -cpus 10 http://127.0.0.1:9595
Summary:
Total: 20.0018 secs
Slowest: 0.0610 secs
Fastest: 0.0001 secs
Average: 0.0020 secs
Requests/sec: 96421.6241
Total data: 25071878 bytes
Size/request: 25 bytes
Response time histogram:
0.000 [1] |
0.006 [993014] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.012 [6571] |
0.018 [353] |
0.024 [12] |
0.031 [29] |
0.037 [9] |
0.043 [8] |
0.049 [0] |
0.055 [0] |
0.061 [3] |
Latency distribution:
10% in 0.0002 secs
25% in 0.0003 secs
50% in 0.0007 secs
75% in 0.0014 secs
90% in 0.0023 secs
95% in 0.0031 secs
99% in 0.0055 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0000 secs, 0.0000 secs, 0.0018 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
req write: 0.0000 secs, 0.0000 secs, 0.0584 secs
resp wait: 0.0013 secs, 0.0000 secs, 0.0388 secs
resp read: 0.0004 secs, 0.0000 secs, 0.0292 secs
Status code distribution:
[200] 1000000 responses
16000 rps VS 96000 rps
Let's go some causal profiling!
https://github.com/plasma-umass/coz
These modern network-based projects are exactly where causal profilers succeed and traditional profilers fall a bit short.
We'll need to set up some breakpoints in the pistache library. I'm not familiar enough with pistache's internals to tinker with it. Happy to help if you give me some pointers to the "important parts." Someone mentioned heap allocations? Any file or class in particular?