tempo
tempo copied to clipboard
Metrics-generator production readiness
A new component metrics-generator
has been recently added to Tempo in https://github.com/grafana/tempo/pull/1282. But it's not considered production ready yet. This issue aims to list the additional work that is needed before its official release.
The first implementation is based on design document https://github.com/grafana/tempo/pull/1206. The design document is, at the time of writing, open and we encourage everyone to take a look.
All feedback is highly appreciated. Feel free to do so in this issue, by pinging us (@kvrhdn and @mapno) at Slack, or by opening PRs with changes and proposals.
TODOs
The following list contains all tasks that we aim to complete before the component's official release. Tasks are divided by topic or area within Tempo.
Distributor
- [ ] Include metrics-generator in distributor tests
- [x] Improve how we stream data from distributor -> metrics-generator
- Use a workers/channels mechanism to push traces to metrics generators instead of spawning goroutines. Discussed in https://github.com/grafana/tempo/pull/1282#discussion_r804943641
- Design queueing and retries mechanisms
Trace format
- [ ] Use a custom trace format (so not OTLP), but something more lightweight that contains minimal info. We should benchmark and compare performance.
Metrics generator
- [x] Make processors configurable per tenant and reload config at run-time (using overrides)
- [ ] We might be ingesting old spans (= took a long time to get to Tempo), we should filter them out or count these correctly
Spanmetrics processor
- [ ] Verify correctness, compare with agent / OTel processor
- [x] Ensure dimensions/labels map to valid Prometheus labels
Service graphs processor
- [ ] Verify correctness, compare with agent / OTel processor
- [ ] Investigate code reusability between Tempo and the Agent.
- [x] Make dimensions/labels we include configurable
- https://github.com/grafana/tempo/pull/1361
- [x] Add a metric to track expired spans
- https://github.com/grafana/tempo/pull/1361
Metrics registry
- [x] Deal with inactive/stale series: if a metrics-generator instance is running for multiple days, the amount of samples it will write per request will continuously grow. We should trim this now and then by removing time series that have not been updated recently.
- [x] Enforce max active series limits per tenant
- [x] Make collection interval configurable per tenant
- https://github.com/grafana/tempo/pull/1340
- [x] Support exemplars
Remote write
- [ ] ~~Parallelise write requests~~
- [ ] ~~Make remote write requests max size configurable.~~
- [x] Use a queue to store samples in the appender, if we add too many samples discard oldest
- #1323
Operations
- [ ] Add a dedicated operational dashboard to tempo-mixin
- [x] Add a docker-compose example
- Update the helm charts
- [ ] tempo single-binary
- [ ] tempo distributed
Documentation
- [x] Include metrics-generator on configuration page
- https://github.com/grafana/tempo/pull/1328
- [x] Document metrics-generator ring on ring page
- [ ] Reconcile service graphs and spanmetrics docs between Tempo and the Agent
- [ ] Operational information
- Describe how to deploy it and what tradeoffs there are
- Describe what metrics to watch
Bug 🐛 : collections/remote writes doubles after a time
Observed this in one of our clusters: after a while the amount of collections doubles or even triples. I'm guessing we are somehow spawning extra goroutines that collect metrics. Workaround: restart the instance 🤷🏻
These instances are configured with collection_interval: 15s
, so we expect 4 remote writes / minute.
To properly solve this we should probably redesign the registry so we can get collects reliable and independent per tenant.
Bug 🐛 : single-binary mode crashes due to duplicate metrics registration
The metrics-generator and the compactor use the dskit BasicLifecycler. When run in single-binary modus, both lifecycler run in the same process and register the same metrics twice.
The following calls are causing the issue: ring.NewBasicLifecycler
and ring.NewWithStoreClientAndStrategy
.
We should tweak the dskit ring to ensure the metrics only get registered once: https://github.com/grafana/dskit/issues/133
panic: duplicate metrics collector registration attempted
goroutine 1 [running]:
github.com/prometheus/client_golang/prometheus.(*wrappingRegisterer).MustRegister(0x4000492570, {0x40003ff9b0, 0x1, 0x1})
/Users/koenraad/Repositories/grafana/tempo/vendor/github.com/prometheus/client_golang/prometheus/wrap.go:104 +0x18c
github.com/prometheus/client_golang/prometheus/promauto.Factory.NewGaugeVec(
{{0x1e19ec0, 0x4000492570}}, {{0x0, 0x0}, {0x0, 0x0}, {0x19590e6, 0x1d}, {0x197fa60, 0x2b}, ...}, ...)
/Users/koenraad/Repositories/grafana/tempo/vendor/github.com/prometheus/client_golang/prometheus/promauto/auto.go:308 +0xd8
github.com/grafana/dskit/ring.NewWithStoreClientAndStrategy({{{0x1930457, 0x8}, {0x1934841, 0xb}, {{{...}, {...}, 0x4a817c800, 0x0, 0x3ff0000000000000, 0x1, ...}, ...}, ...}, ...}, ...)
/Users/koenraad/Repositories/grafana/tempo/vendor/github.com/grafana/dskit/ring/ring.go:228 +0x158
github.com/grafana/dskit/ring.New({{{0x1930457, 0x8}, {0x1934841, 0xb}, {{{...}, {...}, 0x4a817c800, 0x0, 0x3ff0000000000000, 0x1, ...}, ...}, ...}, ...}, ...)
/Users/koenraad/Repositories/grafana/tempo/vendor/github.com/grafana/dskit/ring/ring.go:213 +0x28c
github.com/grafana/tempo/pkg/ring.New({{{0x1930457, 0x8}, {0x1934841, 0xb}, {{{...}, {...}, 0x4a817c800, 0x0, 0x3ff0000000000000, 0x1, ...}, ...}, ...}, ...}, ...)
/Users/koenraad/Repositories/grafana/tempo/pkg/ring/ring.go:23 +0x170
github.com/grafana/tempo/cmd/tempo/app.(*App).initGeneratorRing(0x40006df500)
/Users/koenraad/Repositories/grafana/tempo/cmd/tempo/app/modules.go:93 +0xb0
github.com/grafana/dskit/modules.(*Manager).initModule(0x4000203638, {0x400011edb0, 0x3}, 0x400071d630, 0x4000194c60)
/Users/koenraad/Repositories/grafana/tempo/vendor/github.com/grafana/dskit/modules/modules.go:106 +0x250
github.com/grafana/dskit/modules.(*Manager).InitModuleServices(0x4000203638, {0x400071d7e8, 0x1, 0x1})
/Users/koenraad/Repositories/grafana/tempo/vendor/github.com/grafana/dskit/modules/modules.go:78 +0xe4
github.com/grafana/tempo/cmd/tempo/app.(*App).Run(0x40006df500)
/Users/koenraad/Repositories/grafana/tempo/cmd/tempo/app/app.go:261 +0x1f4
main.main()
/Users/koenraad/Repositories/grafana/tempo/cmd/tempo/main.go:108 +0xa18
I think we can close this issue as we're not following this list. Feel free to re-open or tackle any of the items still left to be completed.