NuRaft icon indicating copy to clipboard operation
NuRaft copied to clipboard

Difference between echo_server and bench

Open Steamgjk opened this issue 4 years ago • 3 comments

I made a rough diff between echo_server and bench.

  1. I notice that while doing init_raft in echo_server, asio_opt.thread_pool_size_ = 4. But in the bench, asio_opt.thread_pool_size_ = 32. Will the thread_pool_size casue a great difference to the performance?
  2. In bench program, there is asio_listener_ created, but in echo_server, there is not. So what is the use of asio_listener_? Why doesn't echo_server need it?
  3. How much is the difference between echo_server and bench regarding performance? I feel echo_server is easy to understand, and it seems that its logic is just "append the log + print the log". If I remove the "print the log", will its performance be comparable to the bench program? (Because I want to do some perf test with distributed clients, rather than coupling clients and leader into one single program. I feel echo_server is easier to tailor and understand than bench program, so I am wondering whether echo_server can also serve as a bench program)

Steamgjk avatar May 09 '21 08:05 Steamgjk

Hi @Steamgjk

  1. It depends on 1) your workload and 2) how many cores your machine has. In many-core machines, more threads will help to achieve better CPU utilization. If your workload is enough to fully utilize 4 threads (so that CPU usage is nearly 400%) and you have more than 4 cores, then increasing the thread pool size will make the performance better. Otherwise, the improvement will be marginal.

  2. Examples use raft_launcher which internally creates asio_listener_ : https://github.com/eBay/NuRaft/blob/b1f7c07aaf0e263159ac3a530c1d7a9ecc5e741b/src/launcher.cxx#L39

  3. There are a few different Raft settings: https://github.com/eBay/NuRaft/blob/b1f7c07aaf0e263159ac3a530c1d7a9ecc5e741b/tests/bench/raft_bench.cxx#L195-L202 If you set the same parameters, there should be no performance difference.

greensky00 avatar May 09 '21 19:05 greensky00

Thanks for the explanation, @greensky00 I just made a bench test on Google Cloud. I directly use the bench program in the repo. I am using 3 replica VMs, each is n1-standard-32 type (that means, each VM has 32 cores), surely asio_opt.thread_pool_size_ = 32. The testing result is as follows. We can see that the max throughput is only 34.3K/second, and p50 latency is 845 us.

图片

Then, I made a low-load test, and the result is as follows. With 1K/second load, the p50 latency is 362 us. 图片

In my cluster, the ping latency is around 250~300 us, that means one RTT should be around this value. Considering the message serialization/deserialization also takes some time, I feel fine with the low-load latency (362 us) [If we become more critical, we can still find some inconsistency between the results and your reported bench results, because your RTT is 180us and you can reach the median value of 187 us. For me, my RTT is around 250 but I reach 362us]. However, I am a little concerned about the throughput number, and there is still 7K/second gap between my result and your reported bench results (around 40K/second with 16 client threads). And according to your bench report, the replicas only have 8 cores, so that means, I am using more powerful VMs but earns less powerful results. Do you have any idea about that? What black magics can we use to further improve the throughput?

图片

Steamgjk avatar May 09 '21 22:05 Steamgjk

@Steamgjk The numbers on the benchmark result page are just for reference, and of course, the performance will vary according to the environment.

And note that the workload generated by the benchmark program will not be CPU-bound unless the network is super-fast. That means having 32 cores does not help to improve the performance. Given the fact that your network environment is a bit slower than that of our data center, the discrepancy of 34.3K vs. 40K seems reasonable.

Regarding p50 latency, the number in the result page was measured with the workload of a single client thread + max throughput. You can increase your target throughput to a big number (let's say 1M) and re-measure it:

raft_bench 1 10.128.0.59:12345 120 1000000 1 256 10.128.0.73:12345 10.128.0.28:12345

If you use multiple client threads and higher throughput, a longer p50 latency is expected. Even though client threads independently call the Raft API in parallel, each request will be assigned with a unique Raft log index number, and replication should be done exactly in that order. That means some requests with bad luck should wait for the completion of the replication (including commit) of Raft logs with smaller index numbers, and this wait time is reflected in the latencies.

greensky00 avatar May 10 '21 04:05 greensky00