tempo Metrics generator stopped generating span metrics for a tenant

Describe the bug

We have 4 tenants in a tempo cluster, after a point of time (at about 2022-09-05 17:17:00), one of those tenants has no span metrics generated any more, other tenants have no such issue.

After we restarted metrics-generator, it started to generate metrics for this tenant again

To Reproduce Sorry that I don't find a way to reproduce this issue

Environment:

Infrastructure: Kubernetes

Additional Context

Some related metrics:

rate(tempo_metrics_generator_spans_received_total{component="metrics-generator", tenant="infra"}[$__rate_interval]):

tempo_metrics_generator_registry_collections_total{component="metrics-generator", tenant="infra"}:

rate(prometheus_remote_storage_sent_batch_duration_seconds_count{component="metrics-generator", tenant="infra"}[$__rate_interval]):

No error/warning logs from metrics-generator at that time
Tenant's config:

infra:
    ingestion_rate_limit_bytes: 31457280
    ingestion_burst_size_bytes: 41943040
    max_traces_per_user: 50000
    metrics_generator_processors:
      - span-metrics
    block_retention: 168h

Sep 06 '22 09:09 suraciii

Hi @suraciii, thanks for reporting. Without any errors or logs, it's very difficult to know what's going on. I suspect that a loop somewhere could've have stopped (e.g. a panic in a goroutine that wasn't catched). Could you sharetempo_metrics_generator_registry_active_series for the same time period?

Sep 06 '22 10:09 mapno

Hi @suraciii, thanks for reporting. Without any errors or logs, it's very difficult to know what's going on. I suspect that a loop somewhere could've have stopped (e.g. a panic in a goroutine that wasn't catched). Could you sharetempo_metrics_generator_registry_active_series for the same time period?

Thanks for your reply.

Here's the graph for tempo_metrics_generator_registry_active_series{component="metrics-generator", tenant="infra"}:

Sep 06 '22 10:09 suraciii

I found a potential error may not be logged https://github.com/grafana/tempo/blob/v1.5.0/modules/generator/registry/registry.go#L209-L212

I can add logs here and see if any messages are printed in the next few days

Sep 06 '22 10:09 suraciii

I found a potential error may not be logged https://github.com/grafana/tempo/blob/v1.5.0/modules/generator/registry/registry.go#L209-L212

I can add logs here and see if any messages are printed in the next few days

That's a great find. Do you notice if the log from the next line is missing during that time (i.e. collection is exiting in commit)?

Sep 06 '22 11:09 mapno

I found a potential error may not be logged https://github.com/grafana/tempo/blob/v1.5.0/modules/generator/registry/registry.go#L209-L212

I can add logs here and see if any messages are printed in the next few days

The code is a bit convoluted but the error should be captured and logged here: https://github.com/grafana/tempo/blob/v1.5.0/modules/generator/registry/registry.go#L181-L187 Since tempo_metrics_generator_registry_collections_total is being incremented, this defer function is being called.

Sep 06 '22 12:09 kvrhdn

@mapno @kvrhdn My mistake, I missed the defer function. And "collecting metrics" messages were printed continously

sum by(app) (count_over_time({app="tempo", component="metrics-generator"} | logfmt | tenant = infra| msg =collecting metrics [1m])):

Sep 07 '22 01:09 suraciii

prometheus_tsdb_wal_segment_current vs prometheus_wal_watcher_current_segment

Sep 07 '22 03:09 suraciii

Update:

I built a container image from main branch for metrics-generator and deployed it. Same issue happened again:

metrics: prometheus_remote_storage_samples_pending{component="metrics-generator",namespace="tempo",tenant="infra"}

logs: {app="tempo", component=~"metrics-generator"} | logfmt | tenant = `infra` | msg=`runShard timer ticked, sending buffered data`

I suspect it's a deadlock issue at https://github.com/grafana/tempo/blob/v1.5.0/vendor/github.com/prometheus/prometheus/storage/remote/queue_manager.go#L1136, and it's possible fixed by https://github.com/prometheus/prometheus/pull/10395 already.

I've bumped the prometheus dependency to v2.34.0, to see if this issue will occur again

Sep 13 '22 10:09 suraciii

Hi, have you seen this issue reoccur after you updated the Prometheus dependency? If you don't mind, we would grateful for a PR updating Prometheus 🙂

Sep 21 '22 16:09 kvrhdn

Hi, have you seen this issue reoccur after you updated the Prometheus dependency? If you don't mind, we would grateful for a PR updating Prometheus 🙂

@kvrhdn This issue never reoccurs after Prometheus dependency updated, I opened a PR for the dependency change.

Sep 22 '22 04:09 suraciii

Nice, that's great news 🙌🏻

Sep 22 '22 12:09 kvrhdn

Any idea when this will get cut into a release?

Oct 05 '22 17:10 hobbsh

tempo tempo copied to clipboard

Metrics generator stopped generating span metrics for a tenant

tempo
tempo copied to clipboard