apm-server Benchmarking 2.0 infrastructure and automation

Description

The new benchmarking framework will take a slightly different route than our current approach, namely it will look to leverage either ESS (preferred) or an on-demand ECE environment (ecetest) to run nightly periodic benchmarks that allow us to track the APM Server performance.

The lowest effort would be to use a region in ESS where we can benchmark the APM Server's throughput, collect the results and index them to a long-lived remote Elasticsearch cluster with gobench. If there are unforeseen limitations when using ESS to run the benchmarks, we can then look into spinning up an on-demand ECE environment, but that is much costlier, in terms of time, resources and monetary cost.

Considerations

Keep the benchmarks run by hey-apm for a while and only decommission those after we're happy with the new approach.
Run apmbench in a machine that is appropriate for the work and is as close as possible to the workload (same CSP region).
The hardware profile must be able to handle a high level of concurrency since we'll be looking to run apmbench with a medium to high number of agents, and decent network performance.

Approach

Docker image tag

Since the Elastic Stack and APM Server will be running in ESS, the software must be packaged in docker images, it is out of scope for this issue to build these images, but to reduce the risk of running the benchmarks against an upstream version that doesn't completely work, we should have some guarantees in place and a vetting process for the "latest" version.

Since we already have a workflow that updates each of the APM Server's active branches docker images, we could rely on the docker image tags that are used in our docker-compose.yml file and specify the current image's tag as the docker image to use in <elasticsearch|kibana|apm>.config.docker_image when creating the ESS deployment. See the Terraform provider acceptance test that uses docker_image.

Deployment lifecycle

The most cost effective and efficient approach is to create a new deployment in ESS and a new VM in the same region with the desired hardware profile for the apmbench runner and upload the credentials for apmbench to connect to the deployment. After the benchmarks have been run, and the results uploaded to a persistent deployment where we'd store them, the deployment and apmbench vm should be town down, to cut down costs.

The terraform configuration for the benchmark deployment could live in the APM Server repo and it could also be used for APM Server developers when there are changes in the APM Server that wish to be benchmarked, a limitation, however, would be that a cloud docker image would need to be built and uploaded to allow the testing to take place.

Automation work

[x] Create or re-use an Elastic owned ESS account to hold the benchmarking deployments.
[x] Decide which Cloud Service Provider to use and obtain the necessary credentials for it to work.
[x] Set up a Jenkins job that runs daily and executes the benchmarks.
[ ] Publish the results the benchmark results in #apm-server.

Apr 11 '22 14:04 marclop

Please ensure to run the benchmarks in the cloud-first testing regions.

Jun 15 '22 09:06 simitt

@simitt we are:

https://github.com/elastic/apm-server/blob/c30b2dae5f1ceb50ccd0b79cb8cc1eb05e2dd14e/testing/benchmark/variables.tf#L10-L14

Jun 17 '22 08:06 marclop

Everything in this ticket has been done except the performance reporting back to #apm-server.

Jul 12 '22 08:07 marclop

@simitt We are tracking the need for the Slack reporting here. We've had some recent staffing changes and this might be delayed a bit. I will update you again when I know more.

Jul 29 '22 16:07 cachedout