tempo icon indicating copy to clipboard operation
tempo copied to clipboard

Metrics generator stopped generating span metrics for a tenant

Open suraciii opened this issue 2 years ago • 11 comments

Describe the bug

We have 4 tenants in a tempo cluster, after a point of time (at about 2022-09-05 17:17:00), one of those tenants has no span metrics generated any more, other tenants have no such issue.

After we restarted metrics-generator, it started to generate metrics for this tenant again

To Reproduce Sorry that I don't find a way to reproduce this issue

Environment:

  • Infrastructure: Kubernetes

Additional Context

  1. Some related metrics:

rate(tempo_metrics_generator_spans_received_total{component="metrics-generator", tenant="infra"}[$__rate_interval]): image

tempo_metrics_generator_registry_collections_total{component="metrics-generator", tenant="infra"}: image

rate(prometheus_remote_storage_sent_batch_duration_seconds_count{component="metrics-generator", tenant="infra"}[$__rate_interval]): image

  1. No error/warning logs from metrics-generator at that time

  2. Tenant's config:

infra:
    ingestion_rate_limit_bytes: 31457280
    ingestion_burst_size_bytes: 41943040
    max_traces_per_user: 50000
    metrics_generator_processors:
      - span-metrics
    block_retention: 168h

suraciii avatar Sep 06 '22 09:09 suraciii

Hi @suraciii, thanks for reporting. Without any errors or logs, it's very difficult to know what's going on. I suspect that a loop somewhere could've have stopped (e.g. a panic in a goroutine that wasn't catched). Could you sharetempo_metrics_generator_registry_active_series for the same time period?

mapno avatar Sep 06 '22 10:09 mapno

Hi @suraciii, thanks for reporting. Without any errors or logs, it's very difficult to know what's going on. I suspect that a loop somewhere could've have stopped (e.g. a panic in a goroutine that wasn't catched). Could you sharetempo_metrics_generator_registry_active_series for the same time period?

Thanks for your reply.

Here's the graph for tempo_metrics_generator_registry_active_series{component="metrics-generator", tenant="infra"}: image

suraciii avatar Sep 06 '22 10:09 suraciii

I found a potential error may not be logged https://github.com/grafana/tempo/blob/v1.5.0/modules/generator/registry/registry.go#L209-L212

I can add logs here and see if any messages are printed in the next few days

suraciii avatar Sep 06 '22 10:09 suraciii

I found a potential error may not be logged https://github.com/grafana/tempo/blob/v1.5.0/modules/generator/registry/registry.go#L209-L212

I can add logs here and see if any messages are printed in the next few days

That's a great find. Do you notice if the log from the next line is missing during that time (i.e. collection is exiting in commit)?

mapno avatar Sep 06 '22 11:09 mapno

I found a potential error may not be logged https://github.com/grafana/tempo/blob/v1.5.0/modules/generator/registry/registry.go#L209-L212

I can add logs here and see if any messages are printed in the next few days

The code is a bit convoluted but the error should be captured and logged here: https://github.com/grafana/tempo/blob/v1.5.0/modules/generator/registry/registry.go#L181-L187 Since tempo_metrics_generator_registry_collections_total is being incremented, this defer function is being called.

kvrhdn avatar Sep 06 '22 12:09 kvrhdn

@mapno @kvrhdn My mistake, I missed the defer function. And "collecting metrics" messages were printed continously

sum by(app) (count_over_time({app="tempo", component="metrics-generator"} | logfmt | tenant = infra| msg =collecting metrics [1m])): image

suraciii avatar Sep 07 '22 01:09 suraciii

prometheus_tsdb_wal_segment_current vs prometheus_wal_watcher_current_segment image

suraciii avatar Sep 07 '22 03:09 suraciii

Update:

I built a container image from main branch for metrics-generator and deployed it. Same issue happened again:

metrics: prometheus_remote_storage_samples_pending{component="metrics-generator",namespace="tempo",tenant="infra"} image

logs: {app="tempo", component=~"metrics-generator"} | logfmt | tenant = `infra` | msg=`runShard timer ticked, sending buffered data` image

I suspect it's a deadlock issue at https://github.com/grafana/tempo/blob/v1.5.0/vendor/github.com/prometheus/prometheus/storage/remote/queue_manager.go#L1136, and it's possible fixed by https://github.com/prometheus/prometheus/pull/10395 already.

I've bumped the prometheus dependency to v2.34.0, to see if this issue will occur again

suraciii avatar Sep 13 '22 10:09 suraciii

Hi, have you seen this issue reoccur after you updated the Prometheus dependency? If you don't mind, we would grateful for a PR updating Prometheus 🙂

kvrhdn avatar Sep 21 '22 16:09 kvrhdn

Hi, have you seen this issue reoccur after you updated the Prometheus dependency? If you don't mind, we would grateful for a PR updating Prometheus 🙂

@kvrhdn This issue never reoccurs after Prometheus dependency updated, I opened a PR for the dependency change.

suraciii avatar Sep 22 '22 04:09 suraciii

Nice, that's great news 🙌🏻

kvrhdn avatar Sep 22 '22 12:09 kvrhdn

Any idea when this will get cut into a release?

hobbsh avatar Oct 05 '22 17:10 hobbsh