pingora-core: benchmark TLS acceptor and connector

Open hargut opened this issue 1 year ago • 1 comments

What is the problem your feature solves, or the need it fulfills?

Currently there are no benchmarks available for the pingora-core TLS acceptor & connector.

To fully understand the performance impact of changes in the area it would be required to have a (stable) benchmarking available for the code paths.

Ideally the benchmarks provide concise data that will help to understand and qualify changes in the area of TLS backends like currently beeing worked on in #336 and related to #29. Further the benchmarking could help to optimize performance within the covered areas.

Describe the solution you'd like

Adding benchmarks for the TLS acceptor and connector using cargo bench. The benchmarks should allow to capture a connect or accept cycle.

Suggest to setup the benchmarks using a valgrind based approach utilizing iai_callgrind. This approach would as well allow for memory profiling (e.g. dhat).

Describe alternatives you've considered

Initial tests that I've already tried using reqwest on a echo acceptor utilizing criterion for measurement showed that the timing results did vary up to several % during various runs (on my local machine).

Aug 22 '24 08:08 hargut

Running some initial experiments showed that:

valgrind / iai_callgrind is very stable for the full blown test case compared to criterion
- initial benchmarks to come to that conclusion contained a full blown echo server startup (with openssl tls backend) , random data generation and executing the reqwests
- variance for most metrics was always < 1% during multiple test runs with various configurations
  - metrics considered to be stable: Instructions, L1 Hits, L2 Hits, Total read+write, Estimated Cycles
  - RAM Hits had sometimes larger variance in the range of up to 5%
iai_callgrind in its current form is not fully providing the required flexibility to run analysis for server/client setups
- started to patch the crate in a way that supports initial acceptor cycle testing
  - the patch (currently not published) allows to run a non-benchmarked rust action using a BinaryBenchmark after executing the binary.

The enhanced experimenting setup contains:

a pingora echo_server that is executed as iai_callgrind BinaryBenchmark
executes a rust function after the benchmark binary is launched
the function waits for a tcp-connect to the echo server and then starts to send some request(s)
the echo_server binary is built using --release --examples

Example outputs between multiple runs on a test scenario that contains 256 requests each having a body size of 64 ASCII chars to provide an idea on the variance:

  Instructions:           699912169|699908503       (+0.00052%) [+1.00001x]
  L1 Hits:                914294651|914283548       (+0.00121%) [+1.00001x]
  L2 Hits:                  9569939|9571815         (-0.01960%) [-1.00020x]
  RAM Hits:                  392068|391533          (+0.13664%) [+1.00137x]
  Total read+write:       924256658|924246896       (+0.00106%) [+1.00001x]
  Estimated Cycles:       975866726|975846278       (+0.00210%) [+1.00002x]

  Instructions:           699935154|699912169       (+0.00328%) [+1.00003x]
  L1 Hits:                914322639|914294651       (+0.00306%) [+1.00003x]
  L2 Hits:                  9569227|9569939         (-0.00744%) [-1.00007x]
  RAM Hits:                  391513|392068          (-0.14156%) [-1.00142x]
  Total read+write:       924283379|924256658       (+0.00289%) [+1.00003x]
  Estimated Cycles:       975871729|975866726       (+0.00051%) [+1.00001x]

Some variance is to be expected as tokio is used and network requests are issued.

Have a nice weekend. :sunflower:

Kind regards, Harald

Aug 23 '24 13:08 hargut