tempo
tempo copied to clipboard
Metrics generator stopped generating span metrics for a tenant
Describe the bug
We have 4 tenants in a tempo cluster, after a point of time (at about 2022-09-05 17:17:00), one of those tenants has no span metrics generated any more, other tenants have no such issue.
After we restarted metrics-generator, it started to generate metrics for this tenant again
To Reproduce Sorry that I don't find a way to reproduce this issue
Environment:
- Infrastructure: Kubernetes
Additional Context
- Some related metrics:
rate(tempo_metrics_generator_spans_received_total{component="metrics-generator", tenant="infra"}[$__rate_interval])
:
tempo_metrics_generator_registry_collections_total{component="metrics-generator", tenant="infra"}
:
rate(prometheus_remote_storage_sent_batch_duration_seconds_count{component="metrics-generator", tenant="infra"}[$__rate_interval])
:
-
No error/warning logs from metrics-generator at that time
-
Tenant's config:
infra:
ingestion_rate_limit_bytes: 31457280
ingestion_burst_size_bytes: 41943040
max_traces_per_user: 50000
metrics_generator_processors:
- span-metrics
block_retention: 168h
Hi @suraciii, thanks for reporting. Without any errors or logs, it's very difficult to know what's going on. I suspect that a loop somewhere could've have stopped (e.g. a panic in a goroutine that wasn't catched). Could you sharetempo_metrics_generator_registry_active_series
for the same time period?
Hi @suraciii, thanks for reporting. Without any errors or logs, it's very difficult to know what's going on. I suspect that a loop somewhere could've have stopped (e.g. a panic in a goroutine that wasn't catched). Could you share
tempo_metrics_generator_registry_active_series
for the same time period?
Thanks for your reply.
Here's the graph for tempo_metrics_generator_registry_active_series{component="metrics-generator", tenant="infra"}
:
I found a potential error may not be logged https://github.com/grafana/tempo/blob/v1.5.0/modules/generator/registry/registry.go#L209-L212
I can add logs here and see if any messages are printed in the next few days
I found a potential error may not be logged https://github.com/grafana/tempo/blob/v1.5.0/modules/generator/registry/registry.go#L209-L212
I can add logs here and see if any messages are printed in the next few days
That's a great find. Do you notice if the log from the next line is missing during that time (i.e. collection is exiting in commit)?
I found a potential error may not be logged https://github.com/grafana/tempo/blob/v1.5.0/modules/generator/registry/registry.go#L209-L212
I can add logs here and see if any messages are printed in the next few days
The code is a bit convoluted but the error should be captured and logged here: https://github.com/grafana/tempo/blob/v1.5.0/modules/generator/registry/registry.go#L181-L187
Since tempo_metrics_generator_registry_collections_total
is being incremented, this defer function is being called.
@mapno @kvrhdn My mistake, I missed the defer function. And "collecting metrics" messages were printed continously
sum by(app) (count_over_time({app="tempo", component="metrics-generator"} | logfmt | tenant =
infra| msg =
collecting metrics [1m]))
:
prometheus_tsdb_wal_segment_current
vs prometheus_wal_watcher_current_segment
Update:
I built a container image from main branch for metrics-generator and deployed it. Same issue happened again:
metrics: prometheus_remote_storage_samples_pending{component="metrics-generator",namespace="tempo",tenant="infra"}
logs: {app="tempo", component=~"metrics-generator"} | logfmt | tenant = `infra` | msg=`runShard timer ticked, sending buffered data`
I suspect it's a deadlock issue at https://github.com/grafana/tempo/blob/v1.5.0/vendor/github.com/prometheus/prometheus/storage/remote/queue_manager.go#L1136, and it's possible fixed by https://github.com/prometheus/prometheus/pull/10395 already.
I've bumped the prometheus dependency to v2.34.0, to see if this issue will occur again
Hi, have you seen this issue reoccur after you updated the Prometheus dependency? If you don't mind, we would grateful for a PR updating Prometheus 🙂
Hi, have you seen this issue reoccur after you updated the Prometheus dependency? If you don't mind, we would grateful for a PR updating Prometheus 🙂
@kvrhdn This issue never reoccurs after Prometheus dependency updated, I opened a PR for the dependency change.
Nice, that's great news 🙌🏻
Any idea when this will get cut into a release?