Service Metrics Aggregator
Introduces a new derivative datastream that holds latency and throughput metrics pivoted by
service.name+service.environment+transaction.type
This yields a significant reduction in the documents we need to query on the Service Inventory Page and gives us a more natural place to query for Services (e.g Service Groups).
Motivation/summary
This work stems from my most recent APM On Week: https://github.com/elastic/apm-dev/issues/768
Which itself is a continuation of the prior On Week effort to make synthtrace behave more closely to how apm-server behaves so we can prototype new aggregations more easily and realistic. https://github.com/elastic/kibana/pull/127257
In the context of APM UI performance investigations we used this to prototype a new service metrics datastream: https://github.com/elastic/kibana/pull/132889
Which we could use to query for service metrics at 100x performance increase by working over significantly less documents/values. In our 2 main test datasets the reduction was as followed.
500 million traces
50 million transaction metrics
500k services timeseries
A few other benefits:
- this new datastream maps really well to TSDB and will benefit further from downsampling
- We can be very explicit about what metadata we want to include and drive other features such as service groups and entity models.
- The higher the load on APM the better the compression to service metrics works in our favor.
Checklist
- [ ] Update CHANGELOG.asciidoc
- [ ] Update package changelog.yml (only if changes to
apmpackagehave been made) - [ ] Documentation has been updated
For functional changes, consider:
- Is it observable through the addition of either logging or metrics?
- Is its use being published in telemetry to enable product improvement?
- Have system tests been added to avoid regression?
How to test these changes
Terminal 1:
apm-server $ docker compose up --force-recreate --build
Terminal 2:
apm-server $ cd systemtest/cmd/runapm
run-apm $ go run main.go -arch amd64 -f -reinstall
I have been using this modified version of synthtrace which currently still only lives in https://github.com/elastic/kibana/pull/136530
node scripts/synthtrace packages/elastic-apm-synthtrace/src/scenarios/high_throughput.ts --local --maxDocs 1000000 --apm http://localhost:8200/ --username admin --skipPackageInstall --clean
This allows you to send --maxDocs traces directly to APM using various datageneration scenario's.
Related issues
https://github.com/elastic/apm-server/issues/8756
This pull request does not have a backport label. Could you fix it @Mpdreamz? 🙏 To fixup this pull request, you need to add the backport labels for the needed branches, such as:
backport-7.xis the label to automatically backport to the7.xbranch.backport-7./dis the label to automatically backport to the7./dbranch./dis the digit
NOTE: backport-skip has been added to this pull request.
:green_heart: Build Succeeded
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
-
Start Time: 2022-08-31T12:25:25.434+0000
-
Duration: 27 min 56 sec
Test stats :test_tube:
| Test | Results |
|---|---|
| Failed | 0 |
| Passed | 130 |
| Skipped | 0 |
| Total | 130 |
:robot: GitHub comments
To re-run your PR in the CI, just comment with:
-
/test: Re-trigger the build. -
/package: Generate and publish the docker images. -
/test windows: Build & tests on Windows. -
runelasticsearch-ci/docs: Re-trigger the docs validation. (use unformatted text in the comment!)
@Mpdreamz can you please link to an issue with some more context and justification for this change.
@simitt Updated the motivation with more context and included links to all prior work that lead to this PR.
:books: Go benchmark report
Diff with the main branch
name old time/op new time/op delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
FetchAndAdd/FetchAndAddToCache-12 100ns ± 4% 93ns ± 2% -6.19% (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
ContextReset/Forwarded_ipv4-12 162ns ± 1% 159ns ± 1% -1.70% (p=0.008 n=5+5)
ContextReset/Forwarded_ipv6-12 170ns ± 1% 167ns ± 0% -1.73% (p=0.008 n=5+5)
ContextReset/X-Real-IP_ipv4-12 119ns ±13% 114ns ± 1% -4.57% (p=0.032 n=5+5)
pkg:github.com/elastic/apm-server/internal/decoder goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/events.ndjson-12 57.9µs ±24% 45.1µs ± 5% -22.06% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/ratelimit.ndjson-12 11.8µs ± 4% 13.2µs ±13% +11.66% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/minimal-service.ndjson-12 1.40µs ± 1% 1.41µs ± 1% +0.87% (p=0.040 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/ratelimit.ndjson-12 7.10µs ± 2% 6.67µs ± 1% -6.11% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/span-links.ndjson-12 1.60µs ± 1% 1.52µs ± 1% -4.49% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/spans.ndjson-12 11.4µs ± 1% 10.8µs ± 1% -5.61% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions-huge_traces.ndjson-12 5.67µs ± 2% 5.46µs ± 2% -3.63% (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions.ndjson-12 11.2µs ± 1% 10.8µs ± 1% -3.61% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans.ndjson-12 11.2µs ± 1% 10.6µs ± 1% -5.66% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum.ndjson-12 2.03µs ± 1% 1.97µs ± 1% -3.13% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum_2.ndjson-12 1.96µs ± 1% 1.89µs ± 1% -3.80% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/unknown-span-type.ndjson-12 7.41µs ± 0% 7.11µs ± 2% -4.11% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors.ndjson-12 7.53µs ± 1% 6.80µs ± 1% -9.64% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_2.ndjson-12 6.99µs ± 1% 6.37µs ± 1% -8.89% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_rum.ndjson-12 2.00µs ± 1% 1.90µs ± 4% -5.27% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_transaction_id.ndjson-12 5.53µs ± 1% 5.37µs ± 1% -2.87% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/events.ndjson-12 13.8µs ± 1% 13.4µs ± 1% -3.11% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-event-type.ndjson-12 798ns ± 2% 762ns ± 2% -4.47% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-event.ndjson-12 3.38µs ± 1% 3.26µs ± 1% -3.50% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-event.ndjson-12 1.08µs ± 1% 1.05µs ± 0% -3.01% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-metadata.ndjson-12 1.95µs ± 0% 1.86µs ± 0% -4.66% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata-2.ndjson-12 487ns ± 1% 467ns ± 2% -4.19% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata.ndjson-12 494ns ± 1% 473ns ± 2% -4.12% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/metadata-null-values.ndjson-12 772ns ± 2% 747ns ± 2% -3.26% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/metadata.ndjson-12 1.28µs ± 3% 1.23µs ± 2% -4.36% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/metricsets.ndjson-12 4.70µs ± 1% 4.47µs ± 1% -4.96% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/minimal-service.ndjson-12 1.07µs ± 1% 1.03µs ± 2% -4.05% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/minimal.ndjson-12 2.14µs ± 5% 2.04µs ± 1% -4.79% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/otel-bridge.ndjson-12 3.22µs ± 1% 3.11µs ± 2% -3.17% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/ratelimit.ndjson-12 6.25µs ± 3% 5.85µs ± 1% -6.43% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/span-links.ndjson-12 1.21µs ± 0% 1.18µs ± 3% -2.71% (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/spans.ndjson-12 9.80µs ± 1% 9.44µs ± 1% -3.77% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions.ndjson-12 9.49µs ± 2% 9.18µs ± 2% -3.26% (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum.ndjson-12 1.62µs ± 2% 1.57µs ± 1% -2.70% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum_2.ndjson-12 1.55µs ± 1% 1.50µs ± 1% -3.28% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/unknown-span-type.ndjson-12 6.20µs ± 1% 6.03µs ± 1% -2.82% (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
AggregateSpan-12 1.09µs ±20% 1.33µs ±26% +22.73% (p=0.032 n=5+5)
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
TraceGroups-12 126ns ± 5% 122ns ± 1% -3.45% (p=0.016 n=5+5)
Process-12 1.84µs ±10% 1.27µs ±25% -31.13% (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64
ShardedWriteTransactionUncontended-12 934ns ±15% 817ns ±10% -12.57% (p=0.032 n=5+5)
ReadEvents/nop_codec/199_events-12 262µs ± 6% 238µs ± 3% -9.09% (p=0.016 n=5+4)
name old alloc/op new alloc/op delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/decoder goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/heavy.ndjson-12 1.27MB ± 1% 1.30MB ± 1% +2.29% (p=0.008 n=5+5)
BackendProcessor/minimal-service.ndjson-12 4.71kB ± 1% 4.66kB ± 1% -1.13% (p=0.008 n=5+5)
BackendProcessor/transactions_spans.ndjson-12 23.9kB ± 1% 24.6kB ± 1% +3.07% (p=0.008 n=5+5)
BackendProcessor/transactions_spans_rum_2.ndjson-12 5.49kB ± 1% 5.56kB ± 0% +1.35% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/heavy.ndjson-12 1.28MB ± 0% 1.31MB ± 1% +2.30% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/span-links.ndjson-12 5.13kB ± 1% 5.24kB ± 1% +2.21% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/transactions_spans_rum_2.ndjson-12 5.97kB ± 1% 6.07kB ± 2% +1.68% (p=0.040 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/heavy.ndjson-12 1.28MB ± 1% 1.32MB ± 1% +2.68% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/ratelimit.ndjson-12 15.9kB ± 1% 16.2kB ± 1% +1.84% (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/transactions.ndjson-12 27.4kB ± 1% 27.9kB ± 1% +1.88% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/transactions_spans.ndjson-12 25.3kB ± 2% 25.8kB ± 1% +1.97% (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/heavy.ndjson-12 1.28MB ± 1% 1.30MB ± 0% +2.20% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/optional-timestamps.ndjson-12 6.06kB ± 2% 6.16kB ± 1% +1.73% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/ratelimit.ndjson-12 15.7kB ± 1% 16.0kB ± 0% +2.21% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/span-links.ndjson-12 5.36kB ± 1% 5.40kB ± 0% +0.75% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum_2.ndjson-12 6.10kB ± 1% 6.20kB ± 1% +1.55% (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/heavy.ndjson-12 1.28MB ± 1% 1.31MB ± 1% +2.32% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/ratelimit.ndjson-12 15.4kB ± 1% 15.8kB ± 1% +2.66% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/spans.ndjson-12 29.7kB ± 0% 30.1kB ± 1% +1.28% (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
Process-12 10.2kB ± 2% 9.8kB ± 3% -3.91% (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64
ReadEvents/json_codec/1_events-12 8.24kB ± 0% 8.28kB ± 0% +0.43% (p=0.008 n=5+5)
ReadEvents/json_codec/10_events-12 75.4kB ± 0% 75.7kB ± 0% +0.41% (p=0.008 n=5+5)
ReadEvents/json_codec/100_events-12 750kB ± 0% 753kB ± 0% +0.40% (p=0.008 n=5+5)
ReadEvents/json_codec/199_events-12 1.01MB ± 0% 1.02MB ± 0% +0.61% (p=0.016 n=5+4)
ReadEvents/json_codec/399_events-12 1.54MB ± 0% 1.56MB ± 0% +0.89% (p=0.008 n=5+5)
ReadEvents/json_codec/1000_events-12 3.34MB ± 1% 3.36MB ± 0% +0.68% (p=0.016 n=5+4)
ReadEvents/json_codec_big_tx/1_events-12 10.3kB ± 0% 10.3kB ± 0% +0.31% (p=0.029 n=4+4)
ReadEvents/json_codec_big_tx/10_events-12 95.5kB ± 0% 95.9kB ± 0% +0.37% (p=0.008 n=5+5)
ReadEvents/json_codec_big_tx/100_events-12 952kB ± 0% 955kB ± 0% +0.37% (p=0.008 n=5+5)
ReadEvents/json_codec_big_tx/199_events-12 1.34MB ± 0% 1.35MB ± 0% +0.50% (p=0.008 n=5+5)
ReadEvents/json_codec_big_tx/399_events-12 2.13MB ± 0% 2.14MB ± 0% +0.60% (p=0.008 n=5+5)
ReadEvents/json_codec_big_tx/1000_events-12 4.76MB ± 0% 4.78MB ± 0% +0.55% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/decoder goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/heavy.ndjson-12 23.3k ± 0% 23.3k ± 0% +0.00% (p=0.048 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64
name old speed new speed delta
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/transactions.ndjson-12 50.8MB/s ± 4% 61.6MB/s ±11% +21.39% (p=0.016 n=4+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/events.ndjson-12 130MB/s ±21% 165MB/s ± 5% +26.46% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/ratelimit.ndjson-12 357MB/s ± 4% 321MB/s ±11% -10.14% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/ratelimit.ndjson-12 593MB/s ± 2% 631MB/s ± 1% +6.50% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/span-links.ndjson-12 428MB/s ± 1% 448MB/s ± 1% +4.68% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/spans.ndjson-12 703MB/s ± 1% 744MB/s ± 1% +5.94% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions-huge_traces.ndjson-12 560MB/s ± 2% 581MB/s ± 2% +3.76% (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions.ndjson-12 503MB/s ± 1% 522MB/s ± 1% +3.75% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans.ndjson-12 518MB/s ± 1% 549MB/s ± 1% +5.99% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum.ndjson-12 568MB/s ± 1% 586MB/s ± 1% +3.22% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum_2.ndjson-12 569MB/s ± 1% 591MB/s ± 1% +3.93% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/unknown-span-type.ndjson-12 446MB/s ± 0% 465MB/s ± 2% +4.29% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors.ndjson-12 842MB/s ± 1% 932MB/s ± 1% +10.67% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_2.ndjson-12 674MB/s ± 1% 740MB/s ± 1% +9.77% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_rum.ndjson-12 947MB/s ± 1% 1000MB/s ± 4% +5.62% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_transaction_id.ndjson-12 692MB/s ± 1% 712MB/s ± 1% +2.96% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/events.ndjson-12 538MB/s ± 1% 555MB/s ± 1% +3.22% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-event-type.ndjson-12 490MB/s ± 2% 513MB/s ± 2% +4.68% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-event.ndjson-12 226MB/s ± 1% 235MB/s ± 1% +3.61% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-event.ndjson-12 541MB/s ± 1% 558MB/s ± 0% +3.09% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-metadata.ndjson-12 229MB/s ± 0% 240MB/s ± 0% +4.87% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata-2.ndjson-12 895MB/s ± 1% 934MB/s ± 2% +4.38% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata.ndjson-12 903MB/s ± 1% 942MB/s ± 2% +4.30% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/metadata-null-values.ndjson-12 681MB/s ± 2% 704MB/s ± 2% +3.37% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/metadata.ndjson-12 968MB/s ± 3% 1012MB/s ± 2% +4.56% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/metricsets.ndjson-12 542MB/s ± 1% 570MB/s ± 1% +5.21% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/minimal-service.ndjson-12 397MB/s ± 1% 414MB/s ± 2% +4.22% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/minimal.ndjson-12 481MB/s ± 5% 505MB/s ± 1% +4.94% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/otel-bridge.ndjson-12 585MB/s ± 1% 604MB/s ± 2% +3.28% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/ratelimit.ndjson-12 674MB/s ± 3% 720MB/s ± 1% +6.83% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/span-links.ndjson-12 564MB/s ± 0% 580MB/s ± 3% +2.80% (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/spans.ndjson-12 819MB/s ± 1% 851MB/s ± 1% +3.92% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions.ndjson-12 595MB/s ± 2% 615MB/s ± 2% +3.36% (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum.ndjson-12 714MB/s ± 2% 734MB/s ± 1% +2.77% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum_2.ndjson-12 721MB/s ± 1% 745MB/s ± 1% +3.39% (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/unknown-span-type.ndjson-12 533MB/s ± 1% 548MB/s ± 1% +2.90% (p=0.008 n=5+5)
report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
:globe_with_meridians: Coverage report
| Name | Metrics % (covered/total) |
Diff |
|---|---|---|
| Packages | 100.0% (43/43) |
:green_heart: |
| Files | 92.0% (184/200) |
:+1: 0.04 |
| Classes | 93.506% (432/462) |
:+1: 0.071 |
| Methods | 89.268% (1098/1230) |
:+1: 0.106 |
| Lines | 77.004% (13538/17581) |
:+1: 0.141 |
| Conditionals | 100.0% (0/0) |
:green_heart: |
I'm not too keen on inventing a new way to count failed transactions for a service, as it feels a bit like overfitting to the current service overview page.
I think 'overfitting' is not a negative as its actually intentional.
Service metrics primary goal is to produce dedicated precalculated metric timeseries for the overview page.
Secondary goal would be:
- Service Groups: attach enough metadata (not dimensions) to create a curated query/grouping capabilities.
Did you already try grouping by event.outcome, and comparing the resulting storage and query efficiency?
I would caution more on the side of introducing more dimensions vs more metrics. We have no need to create additional timeseries per event.outcome.
What's the main worry here? It's not necessary a different way but a new mechanism to deliver the same answer with one being pre-calculated.
Conversely, although having min/max is nice and flexible, I would fit to the current use of aggregations. The aggregate_metric_double type is rather inflexible, and you have to know up front which subfields you may populate. In this case we know, but in the case of dynamic summary metrics we do not.
I'm not invested in keeping p0 and p100 around. They are however reasonably cheap to keep create and keep around. I would love to hear from @elastic/machine-learning team if it has any benefits for any of their jobs.
If we do need them, perhaps we should be storing a latency histogram instead.
I would not be opposed to doing both but would not choose either histogram or metric_aggregate_double. Key for us reducing number of values we need to aggregate over and precalculating as much as possible.
Today our service inventory page does not display p95/p99 I would love to add latency histograms to the emitted metrics to support that in a follow up PR though. The downside is that there is no play for histograms in TSDB and introducing these would potentially delay Service Metrics utilzing TSDB as we work with the team to create a play for histograms.
Or we need to emit two metric documents per service.name+service.environment+transaction.type.
I think 'overfitting' is not a negative as its actually intentional. Service metrics primary goal is to produce dedicated precalculated metric timeseries for the overview page.
What I meant is that it's overfitting to the current service overview page. If it changes (say we want to add percentiles), then we need to adapt or introduce new metrics. Every addition bears a tech-debt cost, and data lives for a long time. So if we can make it a little more flexible then that would be nice.
Of course, there are always tradeoffs. If we do group by event.outcome then there will naturally be more docs, so it depends on the cost.
What's the main worry here? It's not necessary a different way but a new mechanism to deliver the same answer with one being pre-calculated.
At the moment, the way the UI queries for failed transaction rate is by filtering on event.outcome: failed. By introducing a new field, the UI (and any other consumer) now needs to deal with two ways of calculating failed transaction rate. I'm questioning whether the increase in complexity is worthwhile.
Today our service inventory page does not display p95/p99 I would love to add latency histograms to the emitted metrics to support that in a follow up PR though. The downside is that there is no play for histograms in TSDB and introducing these would potentially delay Service Metrics utilzing TSDB as we work with the team to create a play for histograms.
Do you have any numbers on the size of service metrics is vs. transaction metrics? Is it worthwhile rolling up only service metrics in the short-to-mid term?
I realise we need to start somewhere, but it seems of limited value to users to roll up service metrics if we can't roll up the more expensive transaction metrics. So, I think on the whole I'd prefer to go with histogram if it's the eventually ideal data type, even if it means we don't get rollups in the short term.
This pull request is now in conflicts. Could you fix it @Mpdreamz? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/
git fetch upstream
git checkout -b feature/service-metrics-aggregator upstream/feature/service-metrics-aggregator
git merge upstream/main
git push upstream feature/service-metrics-aggregator
At the moment, the way the UI queries for failed transaction rate is by filtering on event.outcome: failed. By introducing a new field, the UI (and any other consumer) now needs to deal with two ways of calculating failed transaction rate. I'm questioning whether the increase in complexity is worthwhile.
I think as we progress with more data transform capabilities in the stack the need to query either pre-aggregated or raw will only increase. The UI already has to discover how to query today even with a single common field.
Dimensions are progressively more costly so would love to keep this to an absolute minimum unless our desired functionality dictate otherwise.
Do you have any numbers on the size of service metrics is vs. transaction metrics? Is it worthwhile rolling up only service metrics in the short-to-mid term?
It's not in terms of number of documents (expectation) but it is also worthwhile as a means to gain confidence in onboarding our metrics indices to TSDB.
I realise we need to start somewhere, but it seems of limited value to users to roll up service metrics if we can't roll up the more expensive transaction metrics. So, I think on the whole I'd prefer to go with histogram if it's the eventually ideal data type, even if it means we don't get rollups in the short term.
Yeah TSDB/Rollups are not a hard requirement at all but I would still keep the initial Service Metrics implementation as minimal and simple as possible to proof its performance claims in the real world.
100% agree we need to include histograms in another iteration of this though.
What I meant is that it's overfitting to the current service overview page. If it changes (say we want to add percentiles), then we need to adapt or introduce new metrics.
I think this is less of an issue, we can include histograms later without being breaking.
Just coming back to the failure_count vs. event.outcome debate...
I'd like us to try and find a way to address https://github.com/elastic/apm-server/issues/5243 at the same time. While looking at that, it occurred to me that just storing failure_count is not good enough. The failed transaction rate is calculated from failure(failure+success) -- this excludes transactions with event.outcome: unknown. We would also need to store the number of successful transactions.
Moving this into draft while I merge main and make some updates.
@Mpdreamz Are we setting a metricset.name for the service metrics, eg metricset.name: "services"? Having a unique identifier for the service metrics is important for the UI when querying for them and determining whether they are available or not.
@sqren yep! https://github.com/elastic/apm-server/pull/8607/files#diff-4e538a1d3c8af2d271c66e8044baca93a083e0d85779d51cc31d4426a9ff0a95R300
It's currently service should it be plural?
@sqren yep! https://github.com/elastic/apm-server/pull/8607/files#diff-4e538a1d3c8af2d271c66e8044baca93a083e0d85779d51cc31d4426a9ff0a95R300
It's currently
serviceshould it be plural?
Perfect! Singular is fine.
This pull request is now in conflicts. Could you fix it @Mpdreamz? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/
git fetch upstream
git checkout -b feature/service-metrics-aggregator upstream/feature/service-metrics-aggregator
git merge upstream/main
git push upstream feature/service-metrics-aggregator