loki icon indicating copy to clipboard operation
loki copied to clipboard

bloom-builder: "panic: duplicate metrics collector registration attempted"

Open diranged opened this issue 1 year ago • 4 comments

Describe the bug Trying to use the new bloom-builder and bloom-planner components introduced by @chaudum in https://github.com/grafana/loki/pull/14003 - but even after we create the /var/loki volume (see https://github.com/grafana/loki/issues/14082), we are running into the builder crashing on startup with this error:

level=info ts=2024-09-09T17:23:51.681576259Z caller=main.go:126 msg="Starting Loki" version="(version=release-3.1.x-89fe788, branch=release-3.1.x, revision=89fe788d)"
level=info ts=2024-09-09T17:23:51.681628661Z caller=main.go:127 msg="Loading configuration file" filename=/etc/loki/config/config.yaml
level=info ts=2024-09-09T17:23:51.681644678Z caller=modules.go:748 component=bloomstore msg="no metas cache configured"
level=info ts=2024-09-09T17:23:51.681730499Z caller=blockscache.go:420 component=bloomstore msg="run ttl evict job"
level=info ts=2024-09-09T17:23:51.681753203Z caller=blockscache.go:380 component=bloomstore msg="run lru evict job"
level=info ts=2024-09-09T17:23:51.681816379Z caller=blockscache.go:365 component=bloomstore msg="run metrics collect job"
level=info ts=2024-09-09T17:23:51.686655187Z caller=server.go:352 msg="server listening on addresses" http=[::]:3100 grpc=[::]:9095
panic: duplicate metrics collector registration attempted

goroutine 1 [running]:
github.com/prometheus/client_golang/prometheus.(*Registry).MustRegister(0x4737440, {0x4000f04610?, 0x0?, 0x0?})
	/src/loki/vendor/github.com/prometheus/client_golang/prometheus/registry.go:405 +0x78
github.com/prometheus/client_golang/prometheus/promauto.Factory.NewCounter({{0x2fadad0?, 0x4737440?}}, {{0x25ded97, 0x4}, {0x0, 0x0}, {0x260e2dc, 0x14}, {0x261d67e, 0x18}, ...})
	/src/loki/vendor/github.com/prometheus/client_golang/prometheus/promauto/auto.go:265 +0x128
github.com/grafana/loki/v3/pkg/storage/bloom/v1.NewMetrics({0x2fadad0, 0x4737440})
	/src/loki/pkg/storage/bloom/v1/metrics.go:62 +0x7c
github.com/grafana/loki/v3/pkg/bloombuild/builder.New({{0x6400000, 0x6400000, {0x0, 0x0}, 0x0, 0x0, 0x0, {0x5f5e100, 0x2540be400, 0xa}, ...}, ...}, ...)
	/src/loki/pkg/bloombuild/builder/builder.go:65 +0x154
github.com/grafana/loki/v3/pkg/loki.(*Loki).initBloomBuilder(0x4000fe3008)
	/src/loki/pkg/loki/modules.go:1586 +0x2b4
github.com/grafana/dskit/modules.(*Manager).initModule(0x40000d88e8, {0xffffc8fbdc0a, 0xd}, 0x4001b78fe8, 0x4000b72b70)
	/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:136 +0x194
github.com/grafana/dskit/modules.(*Manager).InitModuleServices(0x40000d88e8, {0x4000c5cb10, 0x1, 0x1?})
	/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108 +0xb0
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run(0x4000fe3008, {0x0?, {0x4?, 0x2?, 0x4737aa0?}})
	/src/loki/pkg/loki/loki.go:458 +0x74
main.main()
	/src/loki/cmd/loki/main.go:129 +0x10ac

diranged avatar Sep 09 '24 17:09 diranged

Hi @diranged Did you run Loki using the vanilla Helm chart?

I got a different panic, see https://github.com/grafana/loki/pull/14110 but could not reproduce the duplicate metrics registration.

chaudum avatar Sep 11 '24 09:09 chaudum

I am able to reproduce this state now. Loki built from main does not have this issue, so needs to be fixed on the release-3.1.x branch only.

chaudum avatar Sep 12 '24 10:09 chaudum

Thank you for working to reproduce the issue!

diranged avatar Sep 12 '24 15:09 diranged

Hi, same issue on Helm [email protected]

fculpo avatar Oct 07 '24 14:10 fculpo

getting this panic now on the latest main-aec8e96 and k236-with-agg-metric-payload-fix-c5bd2ad tags.

vladst3f avatar Jan 14 '25 20:01 vladst3f

on k237:

level=debug ts=2025-01-17T20:44:08.946008947Z caller=index_set.go:316 table-name=loki_index_tsdb_20104 user-id=fake msg="syncing files for table loki_index_tsdb_20104"
panic: duplicate metrics collector registration attempted
goroutine 1 [running]:
github.com/prometheus/client_golang/prometheus.(*Registry).MustRegister(0x6b67fe0, {0xc0014a8420?, 0x3f59e1f?, 0x14?})
	/src/loki/vendor/github.com/prometheus/client_golang/prometheus/registry.go:406 +0x66
github.com/prometheus/client_golang/prometheus/promauto.Factory.NewCounterVec({{0x46c7450?, 0x6b67fe0?}}, {{0x3f215d7, 0x4}, {0x3f59e1f, 0x14}, {0x3f41a0a, 0xe}, {0x3ffa55d, 0x32}, ...}, ...)
	/src/loki/vendor/github.com/prometheus/client_golang/prometheus/promauto/auto.go:276 +0x163
github.com/grafana/loki/v3/pkg/bloomgateway.newClientMetrics({0x46c7450, 0x6b67fe0})
	/src/loki/pkg/bloomgateway/metrics.go:33 +0x9b
github.com/grafana/loki/v3/pkg/bloomgateway.NewClient({{0x37e11d600}, {0x6400000, 0x6400000, {0x0, 0x0}, 0x0, 0x0, 0x0, {0x5f5e100, 0x2540be400, ...}, ...}, ...}, ...)
	/src/loki/pkg/bloomgateway/client.go:147 +0xd4
github.com/grafana/loki/v3/pkg/loki.(*Loki).initBloomBuilder(0xc0018c4000)
	/src/loki/pkg/loki/modules.go:1678 +0x72c
github.com/grafana/dskit/modules.(*Manager).initModule(0xc00104de90, {0x7ffc71a5079d, 0x7}, 0xc001ca9848, 0xc0012ba600)
	/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:136 +0x1ea
github.com/grafana/dskit/modules.(*Manager).InitModuleServices(0xc00104de90, {0xc001031090, 0x1, 0x7510c18f88e2e5ce?})
	/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108 +0xe8
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run(0xc0018c4000, {0x0?, {0x4?, 0x2?, 0x6b68760?}})
	/src/loki/pkg/loki/loki.go:491 +0x97
main.main()
	/src/loki/cmd/loki/main.go:129 +0x1305

vladst3f avatar Jan 17 '25 20:01 vladst3f

@vladst3f Could you post your Loki config.yaml? What -target= do you run?

chaudum avatar Jan 24 '25 07:01 chaudum

@chaudum, it's a SSD deployment, and it panics on the backend pods. the requested config of this lab where I tested the upgrade from 3.3.0 out is:

config.yaml: |

    analytics:
      reporting_enabled: false
    auth_enabled: false
    bloom_build:
      builder:
        planner_address: loki-backend-headless.observability.svc.cluster.local:9095
      enabled: true
      planner:
        max_table_offset: 7
        planning_interval: 2h
        queue:
          max_queued_tasks_per_tenant: 300000
        retention:
          enabled: true
    bloom_gateway:
      block_query_concurrency: 12
      client:
        addresses: dns+loki-backend-headless.observability.svc.cluster.local:9095
      enabled: true
      max_outstanding_per_tenant: 10240
      num_multiplex_tasks: 512
      worker_concurrency: 6
    chunk_store_config:
      chunk_cache_config:
        background:
          writeback_buffer: 500000
          writeback_goroutines: 1
          writeback_size_limit: 500MB
        default_validity: 18h
        memcached:
          batch_size: 256
          parallelism: 10
        memcached_client:
          addresses: dnssrvnoa+_memcached-client._tcp.loki-chunks-cache.observability.svc
          consistent_hash: true
          max_idle_conns: 72
          timeout: 2000ms
    common:
      compactor_address: 'http://loki-backend:3100'
      path_prefix: /var/loki
      replication_factor: 3
      storage:
        s3:
          access_key_id: ${S3_ACCESS_KEY}
          bucketnames: ${S3_MULTIBUCKET}
          endpoint: ${S3_ENDPOINT}
          http_config:
            insecure_skip_verify: true
          insecure: false
          region: eu-west-1
          s3forcepathstyle: true
          secret_access_key: ${S3_SECRET_ACCESS_KEY}
    compactor:
      compaction_interval: 5m
      delete_batch_size: 2100
      delete_request_cancel_period: 10m
      delete_request_store: s3-multibucket
      max_compaction_parallelism: 2
      retention_delete_worker_count: 300
      retention_enabled: true
      upload_parallelism: 20
    frontend:
      log_queries_longer_than: 10s
      max_outstanding_per_tenant: 4096
      scheduler_address: ""
      tail_proxy_url: ""
    frontend_worker:
      scheduler_address: ""
    index_gateway:
      mode: simple
    ingester:
      chunk_encoding: snappy
      chunk_target_size: 4194304
      flush_op_timeout: 10m
      max_chunk_age: 168h
      wal:
        enabled: false
    limits_config:
      allow_structured_metadata: true
      bloom_creation_enabled: true
      bloom_gateway_enable_filtering: true
      cardinality_limit: 1000000
      discover_service_name:
      - service_name
      - job
      ingestion_burst_size_mb: 300
      ingestion_rate_mb: 200
      ingestion_rate_strategy: local
      max_cache_freshness_per_query: 5m
      max_entries_limit_per_query: 50000
      max_global_streams_per_user: 0
      max_line_size: 0
      max_querier_bytes_read: 0
      max_query_parallelism: 64
      max_query_series: 20000
      max_streams_matchers_per_query: 5000
      per_stream_rate_limit: 200MB
      per_stream_rate_limit_burst: 500MB
      query_timeout: 5m
      reject_old_samples: false
      reject_old_samples_max_age: 168h
      retention_period: 744h
      shard_streams:
        enabled: false
      split_queries_by_interval: 15m
      tsdb_max_query_parallelism: 300
      tsdb_sharding_strategy: bounded
      unordered_writes: true
      volume_enabled: true
    memberlist:
      cluster_label: loki
      join_members:
      - loki-memberlist
    pattern_ingester:
      enabled: true
    querier:
      max_concurrent: 10
      query_ingesters_within: 169h
    query_range:
      align_queries_with_step: true
      cache_results: true
      parallelise_shardable_queries: true
      results_cache:
        cache:
          background:
            writeback_buffer: 500000
            writeback_goroutines: 1
            writeback_size_limit: 500MB
          default_validity: 12h
          memcached_client:
            addresses: dnssrvnoa+_memcached-client._tcp.loki-results-cache.observability.svc
            consistent_hash: true
            timeout: 500ms
            update_interval: 1m
    query_scheduler:
      max_outstanding_requests_per_tenant: 32768
    ruler:
      alertmanager_url: http://prometheus-alertmanager-headless.monitoring-system.svc.cluster.local:9093/
      enable_alertmanager_v2: true
      enable_api: true
      enable_sharding: true
      evaluation:
        mode: remote
        query_frontend:
          address: dns:///loki-read.observability.svc.cluster.local.:9095
      external_url: 'REDACTED'
      remote_write:
        clients:
          prometheusReplica0:
            queue_config:
              capacity: 10000
              retry_on_http_429: true
            url: http://kps-prometheus-replica-0.monitoring-system.svc.cluster.local:9090/api/v1/write
          prometheusReplica1:
            queue_config:
              capacity: 10000
              retry_on_http_429: true
            url: http://kps-prometheus-replica-1.monitoring-system.svc.cluster.local:9090/api/v1/write
        enabled: true
      ring:
        kvstore:
          store: inmemory
      rule_path: /var/loki/scratch
      sharding_algo: by-rule
      storage:
        local:
          directory: /var/loki/rules
        type: local
      wal:
        dir: /var/loki/wal
    runtime_config:
      file: /etc/loki/runtime-config/runtime-config.yaml
    schema_config:
      configs:
      - from: "2023-05-01"
        index:
          period: 24h
          prefix: loki_index_tsdb_
        object_store: s3
        row_shards: 32
        schema: v12
        store: tsdb
      - from: "2023-11-29"
        index:
          period: 24h
          prefix: loki_index_tsdb_
        object_store: s3-multibucket
        row_shards: 32
        schema: v12
        store: tsdb
      - from: "2024-05-09"
        index:
          period: 24h
          prefix: loki_index_tsdb_
        object_store: s3-multibucket
        row_shards: 32
        schema: v13
        store: tsdb
    server:
      grpc_listen_port: 9095
      grpc_server_max_concurrent_streams: 2000
      grpc_server_max_recv_msg_size: 90971520
      grpc_server_max_send_msg_size: 90971520
      http_listen_port: 3100
      http_server_idle_timeout: 20m
      http_server_read_timeout: 10m
      http_server_write_timeout: 10m
      log_level: debug
    storage_config:
      bloom_shipper:
        working_directory: /var/loki/data/blooms
      boltdb_shipper:
        index_gateway_client:
          server_address: ""
      hedging:
        at: 250ms
        max_per_second: 20
        up_to: 3
      named_stores:
        aws:
          s3-multibucket:
            access_key_id: ${S3_ACCESS_KEY}
            bucketnames: ${S3_MULTIBUCKET}
            endpoint: ${S3_ENDPOINT}
            http_config:
              insecure_skip_verify: true
            region: eu-west-1
            s3forcepathstyle: true
            secret_access_key: ${S3_SECRET_ACCESS_KEY}
      tsdb_shipper:
        index_gateway_client:
          server_address: dns+loki-backend-headless.observability.svc.cluster.local:9095
    tracing:
      enabled: false

vladst3f avatar Jan 24 '25 08:01 vladst3f

hi @chaudum - were you able to reproduce ? I can try some things out if you think this might be a configuration issue.

vladst3f avatar Jan 28 '25 10:01 vladst3f

@vladst3f I was able to reproduce the issue and pushed a fix https://github.com/grafana/loki/pull/15994

chaudum avatar Jan 29 '25 09:01 chaudum

cheers @chaudum, much appreciated

vladst3f avatar Jan 29 '25 09:01 vladst3f