pistache icon indicating copy to clipboard operation
pistache copied to clipboard

Comparison with Golang

Open ionkrutov opened this issue 3 years ago • 16 comments

I was curious how much faster a C++ server would be than a Go server. And I was a little surprised ... I write equivalent code in both languages. And testing this with apache benchmark. ab -n 2000000 -c 1000 -k http://localhost:9595/

The golang code looks like this:


import "net/http"

func main() {
	http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
		w.WriteHeader(http.StatusOK)
		w.Write([]byte("Hello World!\n"))
	})

	http.ListenAndServe("localhost:9595", nil)
}

On C++:

#include <pistache/endpoint.h>
#include <iostream>

using namespace Pistache;
using namespace Pistache;

class HelloHandler : public Http::Handler {
public:
    HTTP_PROTOTYPE(HelloHandler)
    void onRequest(const Http::Request& request, Http::ResponseWriter response) {
        response.send(Http::Code::Ok, "Hello World!\n");
    }
};
 
int main() {
  std::cout << "Server begin listening\n" << std::endl;
  Address addr(Ipv4::any(), Port(9595));
  auto opts = Http::Endpoint::options().threads(6);
  Http::Endpoint server(addr);
  server.init(opts);
  server.setHandler(Http::make_handler<HelloHandler>());
  server.serve();

}

ab gave the following results: for C++ : Requests per second: 55138.30 [#/sec] (mean) for Golang: Requests per second: 58193.58 [#/sec] (mean)

I ran the testing several times and always Golang was slightly faster than C++.

What am I doing wrong? How should I properly use multithreading in pistache? Why is C++ lower?

Thank you in advance.

ionkrutov avatar Sep 29 '21 16:09 ionkrutov

Interesting.

  1. How many CPU cores are on your server?
  2. Did you run 'ab' from the same machine? How many cores was it using?
  3. Might be interesting to run "perf top" on the same machine as pistache and see where the CPU time is going.
  4. Might be interesting to run pistache under gprof (yes, it will run much slower...) and see where the CPU time is going.

dennisjenkins75 avatar Sep 29 '21 17:09 dennisjenkins75

Interesting.

Undoubtedly. :-)

  1. How many CPU cores are on your server?

On my AMD Ryzen 5 4600h laptop with 6 cores and 12 threads. (as an argument Http :: Endpoint :: options (). threads (6); specified both 1 and 12 and 255 - the results were about the same)

  1. Did you run 'ab' from the same machine? How many cores was it using?

Yes, in both cases (for Golang and C ++) started the server locally and on the same host ab.

I will answer a little later.

ionkrutov avatar Sep 29 '21 19:09 ionkrutov

Might be interesting to run "perf top" on the same machine as pistache and see where the CPU time is going.

C++ server image

Golang server image

It seems that the C ++ server doesn't work at all.

ionkrutov avatar Sep 29 '21 20:09 ionkrutov

Please tell "perf top" to only look at the pistache process, and try to take the screenshot with the "ab" window not covering up the interesting bits of the perf top output.

dennisjenkins75 avatar Sep 29 '21 20:09 dennisjenkins75

Will it fit?

image

ionkrutov avatar Sep 29 '21 21:09 ionkrutov

Thank you for the report and update. I do not have time to dig deeper at the moment, but hope to within a few days.

Go implements a different threading model than C++17, so that might account for part of it. The hotspot in pistache appears to be heap allocations. Maybe linking against tcmalloc might have a small improvement.

dennisjenkins75 avatar Sep 29 '21 22:09 dennisjenkins75

With recent GCCs parallel execution policies are implemented with Intel's Thread Building Blocks. I'm not sure what Go uses.

kiplingw avatar Sep 29 '21 22:09 kiplingw

Ivan: Which compiler are you using, and if its GCC< can you also run a test where you build pistache with a recent clang?

Kip: Do we have any options for automatic performance testing, similar to our CI?

dennisjenkins75 avatar Sep 29 '21 22:09 dennisjenkins75

I don't think so. But I suspect @Tachi107 could have fun with that with the new skills he's been learning lately.

kiplingw avatar Sep 29 '21 23:09 kiplingw

I attempted to conduct my own performance testing against my pistacheio application. However, most of my overhead was in getting/releasing connections to postgresql from my connection pooler, and in logging http requests to the same database. I did not use "ab" (Apache Benchmark); instead I used "hey" (https://github.com/rakyll/hey); its functionally identical.

I'll need to compile Ivan's examples and tinker.

Suggestion: We either create a "benchmark" directory in pistacheio/pistache and add Ivan's examples (and possibly one in rust), and code up some sort of little "run it locally" benchmark suite, or create a github project like "pistacheio/benchmarks" and place them there. It would be nice to have a set of sample servers that all return identical results for "HTTP GET /" (one using pistache, and others using other tech), and some framework or scripts for testing them all.

dennisjenkins75 avatar Sep 30 '21 02:09 dennisjenkins75

I've modified "examples/hello_server.cc" to use 10 threads. I ran hey (as follows) and received the following results on an AMD 5950x (16 core, 32 thread, 64GiB ram). My system was not idle though, but had a background load of ~2 when the benchmark was not running.

I'm not familiar with meson, so I don't know off the top of my head how to compile it with gprof enabled (-g -pg -no-pie)

$ hey -z 20s -c 100 -cpus 10  http://127.0.0.1:9080/ 

Summary:
  Total:	20.0024 secs
  Slowest:	0.1282 secs
  Fastest:	0.0001 secs
  Average:	0.0053 secs
  Requests/sec:	18947.2669
  
  Total data:	4547892 bytes
  Size/request:	12 bytes

Response time histogram:
  0.000 [1]	|
  0.013 [328294]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.026 [26007]	|■■■
  0.039 [14244]	|■■
  0.051 [6635]	|■
  0.064 [2611]	|
  0.077 [821]	|
  0.090 [223]	|
  0.103 [104]	|
  0.115 [32]	|
  0.128 [19]	|


Latency distribution:
  10% in 0.0004 secs
  25% in 0.0006 secs
  50% in 0.0010 secs
  75% in 0.0020 secs
  90% in 0.0184 secs
  95% in 0.0299 secs
  99% in 0.0514 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0011 secs, 0.0001 secs, 0.1282 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:	0.0013 secs, 0.0000 secs, 0.0933 secs
  resp wait:	0.0006 secs, 0.0000 secs, 0.0686 secs
  resp read:	0.0023 secs, 0.0000 secs, 0.1109 secs

Status code distribution:
  [200]	378991 responses

dennisjenkins75 avatar Sep 30 '21 02:09 dennisjenkins75

I wonder if the meson build uses -O0 or -O2, or some other optimization strategy.

dennisjenkins75 avatar Sep 30 '21 02:09 dennisjenkins75

I should install "ab" and test "ab" vs "hey" w/ identical configs and an identical http server, to see if "hey" performs the same as "ab" or not.

dennisjenkins75 avatar Sep 30 '21 04:09 dennisjenkins75

@dennisjenkins75

Ivan: Which compiler are you using, and if its GCC< can you also run a test where you build pistache with a recent clang?

Yesterday I used clang version 10.0.0-4ubuntu1 and today after your question I decided to compile with g++ 9.3.0. The results are roughly the same.

Or need clang version 12?

ionkrutov avatar Sep 30 '21 18:09 ionkrutov

I also decided to try using hey:

CPP_SERVER:

▶./hey_linux_amd64 -z 20s  -c 100 -cpus 10 http://127.0.0.1:9595

Summary:
  Total:	20.0075 secs
  Slowest:	0.2453 secs
  Fastest:	0.0002 secs
  Average:	0.0060 secs
  Requests/sec:	16578.0475
  
  Total data:	4311918 bytes
  Size/request:	13 bytes

Response time histogram:
  0.000 [1]	|
  0.025 [307081]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.049 [16753]	|■■
  0.074 [5835]	|■
  0.098 [1486]	|
  0.123 [350]	|
  0.147 [99]	|
  0.172 [43]	|
  0.196 [26]	|
  0.221 [10]	|
  0.245 [2]	|


Latency distribution:
  10% in 0.0008 secs
  25% in 0.0012 secs
  50% in 0.0019 secs
  75% in 0.0033 secs
  90% in 0.0144 secs
  95% in 0.0351 secs
  99% in 0.0641 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0017 secs, 0.0001 secs, 0.1927 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:	0.0009 secs, 0.0000 secs, 0.1945 secs
  resp wait:	0.0016 secs, 0.0000 secs, 0.0790 secs
  resp read:	0.0017 secs, 0.0000 secs, 0.1273 secs

Status code distribution:
  [200]	331686 responses



GOLANG_SERVER:

▶./hey_linux_amd64 -z 20s  -c 100 -cpus 10 http://127.0.0.1:9595

Summary:
  Total:	20.0018 secs
  Slowest:	0.0610 secs
  Fastest:	0.0001 secs
  Average:	0.0020 secs
  Requests/sec:	96421.6241
  
  Total data:	25071878 bytes
  Size/request:	25 bytes

Response time histogram:
  0.000 [1]	|
  0.006 [993014]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.012 [6571]	|
  0.018 [353]	|
  0.024 [12]	|
  0.031 [29]	|
  0.037 [9]	|
  0.043 [8]	|
  0.049 [0]	|
  0.055 [0]	|
  0.061 [3]	|


Latency distribution:
  10% in 0.0002 secs
  25% in 0.0003 secs
  50% in 0.0007 secs
  75% in 0.0014 secs
  90% in 0.0023 secs
  95% in 0.0031 secs
  99% in 0.0055 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0000 secs, 0.0018 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0584 secs
  resp wait:	0.0013 secs, 0.0000 secs, 0.0388 secs
  resp read:	0.0004 secs, 0.0000 secs, 0.0292 secs

Status code distribution:
  [200]	1000000 responses




16000 rps VS 96000 rps

ionkrutov avatar Sep 30 '21 19:09 ionkrutov

Let's go some causal profiling!

https://github.com/plasma-umass/coz

These modern network-based projects are exactly where causal profilers succeed and traditional profilers fall a bit short.

We'll need to set up some breakpoints in the pistache library. I'm not familiar enough with pistache's internals to tinker with it. Happy to help if you give me some pointers to the "important parts." Someone mentioned heap allocations? Any file or class in particular?

e-dant avatar Jan 07 '22 17:01 e-dant