grpc_bench icon indicating copy to clipboard operation
grpc_bench copied to clipboard

Rust: tonic benchmark against different allocators

Open mstyura opened this issue 1 year ago • 5 comments

Summary of changes Changes introduced in this pull request:

  • Languages that rely on malloc/free to manage their memory heavily depend on the performance of the allocator library. So, having benchmarks against different allocators could underscore the importance of using the correct allocator. This concept is somewhat similar to using the right GC for Java;
  • I've selected the Rust tonic benchmark and created several additional benchmarks using different allocator libraries in runtime. These benchmarks include glibc, mimalloc, jemalloc, tcmalloc, and musl allocators;
  • To simplify allocator replacement in runtime I've replaced direct Rust dependency on jemalloc allocator with Linux specific LD_PRELOAD;

Reference issue to close (if applicable)

Other information and links Allocators:

  1. glibc allocator;
  2. musl allocator;
  3. jemalloc;
  4. mimalloc;
  5. tcmalloc;

Sources showing allocator selection impact performance:

  1. Benchmarking memory allocators - Julien Voisin;
  2. Suite for benchmarking malloc implementations. ;
  3. Linux: Testing Alternative C Memory Allocators;
  4. Testing Alternative C Memory Allocators Pt 2: The MUSL mystery

mstyura avatar Jun 26 '23 08:06 mstyura

Thanks a lot, @mstyura! I'd love to see how the Tonic server behaves under different allocators. I'll definitely look into the links you provided, it looks pretty interesting.

We also need to be a bit pragmatic. In general, when running benchmarks I use this script. Each benchmarks adds around 180s*max_cpus to the overall benchmark duration. So, given that we do the benchmark duration for up to 6 CPUs, this will add an overhead of 6 additional benchmarks differing by a single parameter - that's 108 minutes.

It'd be great to pick the allocator best suited to this particular workflow (my last research ended up with choosing jemallocator) and put the rest into, e.g., https://github.com/LesnyRumcajs/grpc_bench/tree/master/detailed.

What do you think?

LesnyRumcajs avatar Jun 26 '23 16:06 LesnyRumcajs

Thanks for additional details! I didn't realize the cost introduced by additional benchmarks is important. What I'm thinking right now, is that as a user I'd like to definitely see the benchmark performance under "default" conditions. Rust official images are distributed based on debian (glibc allocator) or alpine (musl allocator), so I believe most of the people end up using default allocators came with the base image. Hence I think the glibc and musl based benchmarks must to be run by default both for single-threaded and multi-threaded benchmark variants, i.e. I believe these benchmarks worth to be enabled by default:

  1. rust_tonic_st_bench - debian with glibc-based allocator;
  2. rust_tonic_st_musl_bench - debian with musl-based allocator;
  3. rust_tonic_mt_bench
  4. rust_tonic_mt_musl_bench.

One of the reasons advanced allocators exists is their ability to better handle of multi-threaded environments, so it make sense to move single-threaded benchmarks with advanced allocators into detailed folder, i.e.

  1. rust_tonic_st_tcmalloc_bench
  2. rust_tonic_st_mimalloc_bench
  3. rust_tonic_st_jemalloc_bench - this could be kept enabled by default to maintain "compatibility" with current rust_tonic_st_bench which is jemalloc-based benchmark.

For the multi-threaded environment I believe it is make sense to benchmark against advanced allocators, since big change to service throughput might be not only due to different library used, but also custom allocator used. So for multi-threaded environment in my opinion would be interesting to have the following benchmarks enabled:

  1. rust_tonic_mt_bench - debian with default glibc allocator;
  2. rust_tonic_st_musl_bench - alpine with default musl allocator;
  3. rust_tonic_st_jemalloc_bench - allocator used currently for benchmark;
  4. rust_tonic_st_mimalloc_bench - I had positive personal experience with it and biased toward having it in default benchmark suite. I personally not used tcmalloc in production code and completely fine to keeping it among extended set of benchmarks.

Does it sound acceptable/reasonable to you? Should I proceed with basically moving:

  1. rust_tonic_st_tcmalloc_bench
  2. rust_tonic_st_mimalloc_bench
  3. rust_tonic_st_jemalloc_bench
  4. rust_tonic_mt_tcmalloc_bench into detailed folder?

mstyura avatar Jun 26 '23 19:06 mstyura

@mstyura Thanks for the write-up. Truth be told, I'm not entirely sure; I'll ponder this later. What bothers me is the precedence - why does tonic get several allocators comparison and not, e.g., other Rust or C++ implementations? I'm considering having a separate benchmark set showing the importance of choosing the optimal allocator for the given workflow. Thoughts?

LesnyRumcajs avatar Jun 27 '23 07:06 LesnyRumcajs

why does tonic get several allocators comparison and not, e.g., other Rust or C++ implementations?

I completely agree with the point that any language which rely on malloc/free can leverage from alternative allcator. And in ideal world without limit of number of benchmarks I'd love to see all rust and c++, etc langs under different allocators. The original reason to choose tonic was that I used this library and it's not an outside according to benchmarks run, so it was interesting to see how it's performance result can be changed when alternative allocator is used. So maybe actually it make sense choose not tonic, but thruster since it seems to be fastest among rust libs (I see that only one Java bench uses many different GC algos).

I'm considering having a separate benchmark set showing the importance of choosing the optimal allocator for the given workflow

That's sounds interesting, I believe it could be a very good indicator of importance of using right allocator.

mstyura avatar Jun 27 '23 09:06 mstyura

I'm considering squashing Java benchmarks into a single one as well, especially if there is one clear winner (in terms of latency and mem/cpu usage.

I'm good with your current suggestion as a first step.

Later, I'd like to choose an optimal allocator based on the results and put all but one implementations (for Rust) under something like detailed_allocators, detailed_garbage_collection. They should not be treated as somehow worse types of implementations so they will have to be in the CI (I fear the current one in the detailed/ is no longer compiling). Then, a Makefile would do nicely and one could run make bench, make bench-alloc etc.

Does it sound good to you?

LesnyRumcajs avatar Jun 30 '23 15:06 LesnyRumcajs