apm-server Service Metrics Aggregator

Introduces a new derivative datastream that holds latency and throughput metrics pivoted by service.name+service.environment+transaction.type

This yields a significant reduction in the documents we need to query on the Service Inventory Page and gives us a more natural place to query for Services (e.g Service Groups).

Motivation/summary

This work stems from my most recent APM On Week: https://github.com/elastic/apm-dev/issues/768

Which itself is a continuation of the prior On Week effort to make synthtrace behave more closely to how apm-server behaves so we can prototype new aggregations more easily and realistic. https://github.com/elastic/kibana/pull/127257

In the context of APM UI performance investigations we used this to prototype a new service metrics datastream: https://github.com/elastic/kibana/pull/132889

Which we could use to query for service metrics at 100x performance increase by working over significantly less documents/values. In our 2 main test datasets the reduction was as followed.

500 million traces
50 million transaction metrics 
500k services timeseries

A few other benefits:

this new datastream maps really well to TSDB and will benefit further from downsampling
We can be very explicit about what metadata we want to include and drive other features such as service groups and entity models.
The higher the load on APM the better the compression to service metrics works in our favor.

Checklist

[ ] Update CHANGELOG.asciidoc
[ ] Update package changelog.yml (only if changes to apmpackage have been made)
[ ] Documentation has been updated

For functional changes, consider:

Is it observable through the addition of either logging or metrics?
Is its use being published in telemetry to enable product improvement?
Have system tests been added to avoid regression?

How to test these changes

Terminal 1:

apm-server $ docker compose up --force-recreate --build

Terminal 2:

apm-server $ cd systemtest/cmd/runapm
run-apm $ go run main.go -arch amd64 -f -reinstall

I have been using this modified version of synthtrace which currently still only lives in https://github.com/elastic/kibana/pull/136530

node scripts/synthtrace packages/elastic-apm-synthtrace/src/scenarios/high_throughput.ts --local --maxDocs 1000000 --apm http://localhost:8200/ --username admin --skipPackageInstall --clean

This allows you to send --maxDocs traces directly to APM using various datageneration scenario's.

Related issues

https://github.com/elastic/apm-server/issues/8756

Jul 13 '22 14:07 Mpdreamz

This pull request does not have a backport label. Could you fix it @Mpdreamz? 🙏 To fixup this pull request, you need to add the backport labels for the needed branches, such as:

backport-7.x is the label to automatically backport to the 7.x branch.
backport-7./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

Jul 13 '22 14:07 mergify[bot]

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-08-31T12:25:25.434+0000
Duration: 27 min 56 sec

Test stats :test_tube:

Test	Results
Failed	0
Passed	130
Skipped	0
Total	130

:robot: GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate and publish the docker images.
/test windows : Build & tests on Windows.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Jul 13 '22 15:07 apmmachine

@Mpdreamz can you please link to an issue with some more context and justification for this change.

Jul 13 '22 15:07 simitt

@simitt Updated the motivation with more context and included links to all prior work that lead to this PR.

Jul 13 '22 15:07 Mpdreamz

:books: Go benchmark report

Diff with the main branch

name                                                                                              old time/op    new time/op    delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
FetchAndAdd/FetchAndAddToCache-12                                                                    100ns ± 4%      93ns ± 2%   -6.19%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
ContextReset/Forwarded_ipv4-12                                                                       162ns ± 1%     159ns ± 1%   -1.70%  (p=0.008 n=5+5)
ContextReset/Forwarded_ipv6-12                                                                       170ns ± 1%     167ns ± 0%   -1.73%  (p=0.008 n=5+5)
ContextReset/X-Real-IP_ipv4-12                                                                       119ns ±13%     114ns ± 1%   -4.57%  (p=0.032 n=5+5)
pkg:github.com/elastic/apm-server/internal/decoder goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/events.ndjson-12                        57.9µs ±24%    45.1µs ± 5%  -22.06%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/ratelimit.ndjson-12                     11.8µs ± 4%    13.2µs ±13%  +11.66%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/minimal-service.ndjson-12               1.40µs ± 1%    1.41µs ± 1%   +0.87%  (p=0.040 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/ratelimit.ndjson-12                     7.10µs ± 2%    6.67µs ± 1%   -6.11%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/span-links.ndjson-12                    1.60µs ± 1%    1.52µs ± 1%   -4.49%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/spans.ndjson-12                         11.4µs ± 1%    10.8µs ± 1%   -5.61%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions-huge_traces.ndjson-12      5.67µs ± 2%    5.46µs ± 2%   -3.63%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions.ndjson-12                  11.2µs ± 1%    10.8µs ± 1%   -3.61%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans.ndjson-12            11.2µs ± 1%    10.6µs ± 1%   -5.66%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum.ndjson-12        2.03µs ± 1%    1.97µs ± 1%   -3.13%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum_2.ndjson-12      1.96µs ± 1%    1.89µs ± 1%   -3.80%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/unknown-span-type.ndjson-12             7.41µs ± 0%    7.11µs ± 2%   -4.11%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors.ndjson-12                      7.53µs ± 1%    6.80µs ± 1%   -9.64%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_2.ndjson-12                    6.99µs ± 1%    6.37µs ± 1%   -8.89%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_rum.ndjson-12                  2.00µs ± 1%    1.90µs ± 4%   -5.27%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_transaction_id.ndjson-12       5.53µs ± 1%    5.37µs ± 1%   -2.87%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/events.ndjson-12                      13.8µs ± 1%    13.4µs ± 1%   -3.11%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-event-type.ndjson-12           798ns ± 2%     762ns ± 2%   -4.47%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-event.ndjson-12               3.38µs ± 1%    3.26µs ± 1%   -3.50%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-event.ndjson-12          1.08µs ± 1%    1.05µs ± 0%   -3.01%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-metadata.ndjson-12       1.95µs ± 0%    1.86µs ± 0%   -4.66%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata-2.ndjson-12           487ns ± 1%     467ns ± 2%   -4.19%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata.ndjson-12             494ns ± 1%     473ns ± 2%   -4.12%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/metadata-null-values.ndjson-12         772ns ± 2%     747ns ± 2%   -3.26%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/metadata.ndjson-12                    1.28µs ± 3%    1.23µs ± 2%   -4.36%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/metricsets.ndjson-12                  4.70µs ± 1%    4.47µs ± 1%   -4.96%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/minimal-service.ndjson-12             1.07µs ± 1%    1.03µs ± 2%   -4.05%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/minimal.ndjson-12                     2.14µs ± 5%    2.04µs ± 1%   -4.79%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/otel-bridge.ndjson-12                 3.22µs ± 1%    3.11µs ± 2%   -3.17%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/ratelimit.ndjson-12                   6.25µs ± 3%    5.85µs ± 1%   -6.43%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/span-links.ndjson-12                  1.21µs ± 0%    1.18µs ± 3%   -2.71%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/spans.ndjson-12                       9.80µs ± 1%    9.44µs ± 1%   -3.77%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions.ndjson-12                9.49µs ± 2%    9.18µs ± 2%   -3.26%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum.ndjson-12      1.62µs ± 2%    1.57µs ± 1%   -2.70%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum_2.ndjson-12    1.55µs ± 1%    1.50µs ± 1%   -3.28%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/unknown-span-type.ndjson-12           6.20µs ± 1%    6.03µs ± 1%   -2.82%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
AggregateSpan-12                                                                                    1.09µs ±20%    1.33µs ±26%  +22.73%  (p=0.032 n=5+5)
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
TraceGroups-12                                                                                       126ns ± 5%     122ns ± 1%   -3.45%  (p=0.016 n=5+5)
Process-12                                                                                          1.84µs ±10%    1.27µs ±25%  -31.13%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64
ShardedWriteTransactionUncontended-12                                                                934ns ±15%     817ns ±10%  -12.57%  (p=0.032 n=5+5)
ReadEvents/nop_codec/199_events-12                                                                   262µs ± 6%     238µs ± 3%   -9.09%  (p=0.016 n=5+4)

name                                                                                              old alloc/op   new alloc/op   delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/decoder goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/heavy.ndjson-12                                                                    1.27MB ± 1%    1.30MB ± 1%   +2.29%  (p=0.008 n=5+5)
BackendProcessor/minimal-service.ndjson-12                                                          4.71kB ± 1%    4.66kB ± 1%   -1.13%  (p=0.008 n=5+5)
BackendProcessor/transactions_spans.ndjson-12                                                       23.9kB ± 1%    24.6kB ± 1%   +3.07%  (p=0.008 n=5+5)
BackendProcessor/transactions_spans_rum_2.ndjson-12                                                 5.49kB ± 1%    5.56kB ± 0%   +1.35%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/heavy.ndjson-12                         1.28MB ± 0%    1.31MB ± 1%   +2.30%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/span-links.ndjson-12                    5.13kB ± 1%    5.24kB ± 1%   +2.21%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/transactions_spans_rum_2.ndjson-12      5.97kB ± 1%    6.07kB ± 2%   +1.68%  (p=0.040 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/heavy.ndjson-12                         1.28MB ± 1%    1.32MB ± 1%   +2.68%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/ratelimit.ndjson-12                     15.9kB ± 1%    16.2kB ± 1%   +1.84%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/transactions.ndjson-12                  27.4kB ± 1%    27.9kB ± 1%   +1.88%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/transactions_spans.ndjson-12            25.3kB ± 2%    25.8kB ± 1%   +1.97%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/heavy.ndjson-12                         1.28MB ± 1%    1.30MB ± 0%   +2.20%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/optional-timestamps.ndjson-12           6.06kB ± 2%    6.16kB ± 1%   +1.73%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/ratelimit.ndjson-12                     15.7kB ± 1%    16.0kB ± 0%   +2.21%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/span-links.ndjson-12                    5.36kB ± 1%    5.40kB ± 0%   +0.75%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum_2.ndjson-12      6.10kB ± 1%    6.20kB ± 1%   +1.55%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/heavy.ndjson-12                       1.28MB ± 1%    1.31MB ± 1%   +2.32%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/ratelimit.ndjson-12                   15.4kB ± 1%    15.8kB ± 1%   +2.66%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/spans.ndjson-12                       29.7kB ± 0%    30.1kB ± 1%   +1.28%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
Process-12                                                                                          10.2kB ± 2%     9.8kB ± 3%   -3.91%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64
ReadEvents/json_codec/1_events-12                                                                   8.24kB ± 0%    8.28kB ± 0%   +0.43%  (p=0.008 n=5+5)
ReadEvents/json_codec/10_events-12                                                                  75.4kB ± 0%    75.7kB ± 0%   +0.41%  (p=0.008 n=5+5)
ReadEvents/json_codec/100_events-12                                                                  750kB ± 0%     753kB ± 0%   +0.40%  (p=0.008 n=5+5)
ReadEvents/json_codec/199_events-12                                                                 1.01MB ± 0%    1.02MB ± 0%   +0.61%  (p=0.016 n=5+4)
ReadEvents/json_codec/399_events-12                                                                 1.54MB ± 0%    1.56MB ± 0%   +0.89%  (p=0.008 n=5+5)
ReadEvents/json_codec/1000_events-12                                                                3.34MB ± 1%    3.36MB ± 0%   +0.68%  (p=0.016 n=5+4)
ReadEvents/json_codec_big_tx/1_events-12                                                            10.3kB ± 0%    10.3kB ± 0%   +0.31%  (p=0.029 n=4+4)
ReadEvents/json_codec_big_tx/10_events-12                                                           95.5kB ± 0%    95.9kB ± 0%   +0.37%  (p=0.008 n=5+5)
ReadEvents/json_codec_big_tx/100_events-12                                                           952kB ± 0%     955kB ± 0%   +0.37%  (p=0.008 n=5+5)
ReadEvents/json_codec_big_tx/199_events-12                                                          1.34MB ± 0%    1.35MB ± 0%   +0.50%  (p=0.008 n=5+5)
ReadEvents/json_codec_big_tx/399_events-12                                                          2.13MB ± 0%    2.14MB ± 0%   +0.60%  (p=0.008 n=5+5)
ReadEvents/json_codec_big_tx/1000_events-12                                                         4.76MB ± 0%    4.78MB ± 0%   +0.55%  (p=0.008 n=5+5)

name                                                                                              old allocs/op  new allocs/op  delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/decoder goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/heavy.ndjson-12                          23.3k ± 0%     23.3k ± 0%   +0.00%  (p=0.048 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64

name                                                                                              old speed      new speed      delta
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/transactions.ndjson-12                                                           50.8MB/s ± 4%  61.6MB/s ±11%  +21.39%  (p=0.016 n=4+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/events.ndjson-12                       130MB/s ±21%   165MB/s ± 5%  +26.46%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/ratelimit.ndjson-12                    357MB/s ± 4%   321MB/s ±11%  -10.14%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/ratelimit.ndjson-12                    593MB/s ± 2%   631MB/s ± 1%   +6.50%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/span-links.ndjson-12                   428MB/s ± 1%   448MB/s ± 1%   +4.68%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/spans.ndjson-12                        703MB/s ± 1%   744MB/s ± 1%   +5.94%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions-huge_traces.ndjson-12     560MB/s ± 2%   581MB/s ± 2%   +3.76%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions.ndjson-12                 503MB/s ± 1%   522MB/s ± 1%   +3.75%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans.ndjson-12           518MB/s ± 1%   549MB/s ± 1%   +5.99%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum.ndjson-12       568MB/s ± 1%   586MB/s ± 1%   +3.22%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum_2.ndjson-12     569MB/s ± 1%   591MB/s ± 1%   +3.93%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/unknown-span-type.ndjson-12            446MB/s ± 0%   465MB/s ± 2%   +4.29%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors.ndjson-12                     842MB/s ± 1%   932MB/s ± 1%  +10.67%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_2.ndjson-12                   674MB/s ± 1%   740MB/s ± 1%   +9.77%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_rum.ndjson-12                 947MB/s ± 1%  1000MB/s ± 4%   +5.62%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_transaction_id.ndjson-12      692MB/s ± 1%   712MB/s ± 1%   +2.96%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/events.ndjson-12                     538MB/s ± 1%   555MB/s ± 1%   +3.22%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-event-type.ndjson-12         490MB/s ± 2%   513MB/s ± 2%   +4.68%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-event.ndjson-12              226MB/s ± 1%   235MB/s ± 1%   +3.61%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-event.ndjson-12         541MB/s ± 1%   558MB/s ± 0%   +3.09%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-metadata.ndjson-12      229MB/s ± 0%   240MB/s ± 0%   +4.87%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata-2.ndjson-12         895MB/s ± 1%   934MB/s ± 2%   +4.38%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata.ndjson-12           903MB/s ± 1%   942MB/s ± 2%   +4.30%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/metadata-null-values.ndjson-12       681MB/s ± 2%   704MB/s ± 2%   +3.37%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/metadata.ndjson-12                   968MB/s ± 3%  1012MB/s ± 2%   +4.56%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/metricsets.ndjson-12                 542MB/s ± 1%   570MB/s ± 1%   +5.21%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/minimal-service.ndjson-12            397MB/s ± 1%   414MB/s ± 2%   +4.22%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/minimal.ndjson-12                    481MB/s ± 5%   505MB/s ± 1%   +4.94%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/otel-bridge.ndjson-12                585MB/s ± 1%   604MB/s ± 2%   +3.28%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/ratelimit.ndjson-12                  674MB/s ± 3%   720MB/s ± 1%   +6.83%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/span-links.ndjson-12                 564MB/s ± 0%   580MB/s ± 3%   +2.80%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/spans.ndjson-12                      819MB/s ± 1%   851MB/s ± 1%   +3.92%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions.ndjson-12               595MB/s ± 2%   615MB/s ± 2%   +3.36%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum.ndjson-12     714MB/s ± 2%   734MB/s ± 1%   +2.77%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum_2.ndjson-12   721MB/s ± 1%   745MB/s ± 1%   +3.39%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/unknown-span-type.ndjson-12          533MB/s ± 1%   548MB/s ± 1%   +2.90%  (p=0.008 n=5+5)

report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

Jul 18 '22 16:07 apmmachine

:globe_with_meridians: Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	100.0% (`43/43`)	:green_heart:
Files	92.0% (`184/200`)	:+1: 0.04
Classes	93.506% (`432/462`)	:+1: 0.071
Methods	89.268% (`1098/1230`)	:+1: 0.106
Lines	77.004% (`13538/17581`)	:+1: 0.141
Conditionals	100.0% (`0/0`)	:green_heart:

Jul 19 '22 16:07 apmmachine

I'm not too keen on inventing a new way to count failed transactions for a service, as it feels a bit like overfitting to the current service overview page.

I think 'overfitting' is not a negative as its actually intentional. Service metrics primary goal is to produce dedicated precalculated metric timeseries for the overview page.

Secondary goal would be:

Service Groups: attach enough metadata (not dimensions) to create a curated query/grouping capabilities.

Did you already try grouping by event.outcome, and comparing the resulting storage and query efficiency?

I would caution more on the side of introducing more dimensions vs more metrics. We have no need to create additional timeseries per event.outcome.

What's the main worry here? It's not necessary a different way but a new mechanism to deliver the same answer with one being pre-calculated.

Conversely, although having min/max is nice and flexible, I would fit to the current use of aggregations. The aggregate_metric_double type is rather inflexible, and you have to know up front which subfields you may populate. In this case we know, but in the case of dynamic summary metrics we do not.

I'm not invested in keeping p0 and p100 around. They are however reasonably cheap to keep create and keep around. I would love to hear from @elastic/machine-learning team if it has any benefits for any of their jobs.

If we do need them, perhaps we should be storing a latency histogram instead.

I would not be opposed to doing both but would not choose either histogram or metric_aggregate_double. Key for us reducing number of values we need to aggregate over and precalculating as much as possible.

Today our service inventory page does not display p95/p99 I would love to add latency histograms to the emitted metrics to support that in a follow up PR though. The downside is that there is no play for histograms in TSDB and introducing these would potentially delay Service Metrics utilzing TSDB as we work with the team to create a play for histograms.

Or we need to emit two metric documents per service.name+service.environment+transaction.type.

Jul 22 '22 12:07 Mpdreamz

I think 'overfitting' is not a negative as its actually intentional. Service metrics primary goal is to produce dedicated precalculated metric timeseries for the overview page.

What I meant is that it's overfitting to the current service overview page. If it changes (say we want to add percentiles), then we need to adapt or introduce new metrics. Every addition bears a tech-debt cost, and data lives for a long time. So if we can make it a little more flexible then that would be nice.

Of course, there are always tradeoffs. If we do group by event.outcome then there will naturally be more docs, so it depends on the cost.

What's the main worry here? It's not necessary a different way but a new mechanism to deliver the same answer with one being pre-calculated.

At the moment, the way the UI queries for failed transaction rate is by filtering on event.outcome: failed. By introducing a new field, the UI (and any other consumer) now needs to deal with two ways of calculating failed transaction rate. I'm questioning whether the increase in complexity is worthwhile.

Today our service inventory page does not display p95/p99 I would love to add latency histograms to the emitted metrics to support that in a follow up PR though. The downside is that there is no play for histograms in TSDB and introducing these would potentially delay Service Metrics utilzing TSDB as we work with the team to create a play for histograms.

Do you have any numbers on the size of service metrics is vs. transaction metrics? Is it worthwhile rolling up only service metrics in the short-to-mid term?

I realise we need to start somewhere, but it seems of limited value to users to roll up service metrics if we can't roll up the more expensive transaction metrics. So, I think on the whole I'd prefer to go with histogram if it's the eventually ideal data type, even if it means we don't get rollups in the short term.

Jul 25 '22 02:07 axw

This pull request is now in conflicts. Could you fix it @Mpdreamz? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature/service-metrics-aggregator upstream/feature/service-metrics-aggregator
git merge upstream/main
git push upstream feature/service-metrics-aggregator

Jul 25 '22 03:07 mergify[bot]

At the moment, the way the UI queries for failed transaction rate is by filtering on event.outcome: failed. By introducing a new field, the UI (and any other consumer) now needs to deal with two ways of calculating failed transaction rate. I'm questioning whether the increase in complexity is worthwhile.

I think as we progress with more data transform capabilities in the stack the need to query either pre-aggregated or raw will only increase. The UI already has to discover how to query today even with a single common field.

Dimensions are progressively more costly so would love to keep this to an absolute minimum unless our desired functionality dictate otherwise.

Do you have any numbers on the size of service metrics is vs. transaction metrics? Is it worthwhile rolling up only service metrics in the short-to-mid term?

It's not in terms of number of documents (expectation) but it is also worthwhile as a means to gain confidence in onboarding our metrics indices to TSDB.

I realise we need to start somewhere, but it seems of limited value to users to roll up service metrics if we can't roll up the more expensive transaction metrics. So, I think on the whole I'd prefer to go with histogram if it's the eventually ideal data type, even if it means we don't get rollups in the short term.

Yeah TSDB/Rollups are not a hard requirement at all but I would still keep the initial Service Metrics implementation as minimal and simple as possible to proof its performance claims in the real world.

100% agree we need to include histograms in another iteration of this though.

What I meant is that it's overfitting to the current service overview page. If it changes (say we want to add percentiles), then we need to adapt or introduce new metrics.

I think this is less of an issue, we can include histograms later without being breaking.

Jul 28 '22 09:07 Mpdreamz

Just coming back to the failure_count vs. event.outcome debate...

I'd like us to try and find a way to address https://github.com/elastic/apm-server/issues/5243 at the same time. While looking at that, it occurred to me that just storing failure_count is not good enough. The failed transaction rate is calculated from failure(failure+success) -- this excludes transactions with event.outcome: unknown. We would also need to store the number of successful transactions.

Aug 09 '22 03:08 axw

Moving this into draft while I merge main and make some updates.

Aug 17 '22 11:08 axw

@Mpdreamz Are we setting a metricset.name for the service metrics, eg metricset.name: "services"? Having a unique identifier for the service metrics is important for the UI when querying for them and determining whether they are available or not.

Aug 29 '22 09:08 sorenlouv

@sqren yep! https://github.com/elastic/apm-server/pull/8607/files#diff-4e538a1d3c8af2d271c66e8044baca93a083e0d85779d51cc31d4426a9ff0a95R300

It's currently service should it be plural?

Aug 29 '22 11:08 Mpdreamz

@sqren yep! https://github.com/elastic/apm-server/pull/8607/files#diff-4e538a1d3c8af2d271c66e8044baca93a083e0d85779d51cc31d4426a9ff0a95R300

It's currently service should it be plural?

Perfect! Singular is fine.

Aug 29 '22 11:08 sorenlouv

This pull request is now in conflicts. Could you fix it @Mpdreamz? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature/service-metrics-aggregator upstream/feature/service-metrics-aggregator
git merge upstream/main
git push upstream feature/service-metrics-aggregator

Aug 30 '22 01:08 mergify[bot]