loki
loki copied to clipboard
bloom-builder: "panic: duplicate metrics collector registration attempted"
Describe the bug
Trying to use the new bloom-builder and bloom-planner components introduced by @chaudum in https://github.com/grafana/loki/pull/14003 - but even after we create the /var/loki volume (see https://github.com/grafana/loki/issues/14082), we are running into the builder crashing on startup with this error:
level=info ts=2024-09-09T17:23:51.681576259Z caller=main.go:126 msg="Starting Loki" version="(version=release-3.1.x-89fe788, branch=release-3.1.x, revision=89fe788d)"
level=info ts=2024-09-09T17:23:51.681628661Z caller=main.go:127 msg="Loading configuration file" filename=/etc/loki/config/config.yaml
level=info ts=2024-09-09T17:23:51.681644678Z caller=modules.go:748 component=bloomstore msg="no metas cache configured"
level=info ts=2024-09-09T17:23:51.681730499Z caller=blockscache.go:420 component=bloomstore msg="run ttl evict job"
level=info ts=2024-09-09T17:23:51.681753203Z caller=blockscache.go:380 component=bloomstore msg="run lru evict job"
level=info ts=2024-09-09T17:23:51.681816379Z caller=blockscache.go:365 component=bloomstore msg="run metrics collect job"
level=info ts=2024-09-09T17:23:51.686655187Z caller=server.go:352 msg="server listening on addresses" http=[::]:3100 grpc=[::]:9095
panic: duplicate metrics collector registration attempted
goroutine 1 [running]:
github.com/prometheus/client_golang/prometheus.(*Registry).MustRegister(0x4737440, {0x4000f04610?, 0x0?, 0x0?})
/src/loki/vendor/github.com/prometheus/client_golang/prometheus/registry.go:405 +0x78
github.com/prometheus/client_golang/prometheus/promauto.Factory.NewCounter({{0x2fadad0?, 0x4737440?}}, {{0x25ded97, 0x4}, {0x0, 0x0}, {0x260e2dc, 0x14}, {0x261d67e, 0x18}, ...})
/src/loki/vendor/github.com/prometheus/client_golang/prometheus/promauto/auto.go:265 +0x128
github.com/grafana/loki/v3/pkg/storage/bloom/v1.NewMetrics({0x2fadad0, 0x4737440})
/src/loki/pkg/storage/bloom/v1/metrics.go:62 +0x7c
github.com/grafana/loki/v3/pkg/bloombuild/builder.New({{0x6400000, 0x6400000, {0x0, 0x0}, 0x0, 0x0, 0x0, {0x5f5e100, 0x2540be400, 0xa}, ...}, ...}, ...)
/src/loki/pkg/bloombuild/builder/builder.go:65 +0x154
github.com/grafana/loki/v3/pkg/loki.(*Loki).initBloomBuilder(0x4000fe3008)
/src/loki/pkg/loki/modules.go:1586 +0x2b4
github.com/grafana/dskit/modules.(*Manager).initModule(0x40000d88e8, {0xffffc8fbdc0a, 0xd}, 0x4001b78fe8, 0x4000b72b70)
/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:136 +0x194
github.com/grafana/dskit/modules.(*Manager).InitModuleServices(0x40000d88e8, {0x4000c5cb10, 0x1, 0x1?})
/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108 +0xb0
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run(0x4000fe3008, {0x0?, {0x4?, 0x2?, 0x4737aa0?}})
/src/loki/pkg/loki/loki.go:458 +0x74
main.main()
/src/loki/cmd/loki/main.go:129 +0x10ac
Hi @diranged Did you run Loki using the vanilla Helm chart?
I got a different panic, see https://github.com/grafana/loki/pull/14110 but could not reproduce the duplicate metrics registration.
I am able to reproduce this state now. Loki built from main does not have this issue, so needs to be fixed on the release-3.1.x branch only.
Thank you for working to reproduce the issue!
Hi, same issue on Helm [email protected]
getting this panic now on the latest main-aec8e96 and k236-with-agg-metric-payload-fix-c5bd2ad tags.
on k237:
level=debug ts=2025-01-17T20:44:08.946008947Z caller=index_set.go:316 table-name=loki_index_tsdb_20104 user-id=fake msg="syncing files for table loki_index_tsdb_20104"
panic: duplicate metrics collector registration attempted
goroutine 1 [running]:
github.com/prometheus/client_golang/prometheus.(*Registry).MustRegister(0x6b67fe0, {0xc0014a8420?, 0x3f59e1f?, 0x14?})
/src/loki/vendor/github.com/prometheus/client_golang/prometheus/registry.go:406 +0x66
github.com/prometheus/client_golang/prometheus/promauto.Factory.NewCounterVec({{0x46c7450?, 0x6b67fe0?}}, {{0x3f215d7, 0x4}, {0x3f59e1f, 0x14}, {0x3f41a0a, 0xe}, {0x3ffa55d, 0x32}, ...}, ...)
/src/loki/vendor/github.com/prometheus/client_golang/prometheus/promauto/auto.go:276 +0x163
github.com/grafana/loki/v3/pkg/bloomgateway.newClientMetrics({0x46c7450, 0x6b67fe0})
/src/loki/pkg/bloomgateway/metrics.go:33 +0x9b
github.com/grafana/loki/v3/pkg/bloomgateway.NewClient({{0x37e11d600}, {0x6400000, 0x6400000, {0x0, 0x0}, 0x0, 0x0, 0x0, {0x5f5e100, 0x2540be400, ...}, ...}, ...}, ...)
/src/loki/pkg/bloomgateway/client.go:147 +0xd4
github.com/grafana/loki/v3/pkg/loki.(*Loki).initBloomBuilder(0xc0018c4000)
/src/loki/pkg/loki/modules.go:1678 +0x72c
github.com/grafana/dskit/modules.(*Manager).initModule(0xc00104de90, {0x7ffc71a5079d, 0x7}, 0xc001ca9848, 0xc0012ba600)
/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:136 +0x1ea
github.com/grafana/dskit/modules.(*Manager).InitModuleServices(0xc00104de90, {0xc001031090, 0x1, 0x7510c18f88e2e5ce?})
/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108 +0xe8
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run(0xc0018c4000, {0x0?, {0x4?, 0x2?, 0x6b68760?}})
/src/loki/pkg/loki/loki.go:491 +0x97
main.main()
/src/loki/cmd/loki/main.go:129 +0x1305
@vladst3f Could you post your Loki config.yaml? What -target= do you run?
@chaudum, it's a SSD deployment, and it panics on the backend pods.
the requested config of this lab where I tested the upgrade from 3.3.0 out is:
config.yaml: |
analytics:
reporting_enabled: false
auth_enabled: false
bloom_build:
builder:
planner_address: loki-backend-headless.observability.svc.cluster.local:9095
enabled: true
planner:
max_table_offset: 7
planning_interval: 2h
queue:
max_queued_tasks_per_tenant: 300000
retention:
enabled: true
bloom_gateway:
block_query_concurrency: 12
client:
addresses: dns+loki-backend-headless.observability.svc.cluster.local:9095
enabled: true
max_outstanding_per_tenant: 10240
num_multiplex_tasks: 512
worker_concurrency: 6
chunk_store_config:
chunk_cache_config:
background:
writeback_buffer: 500000
writeback_goroutines: 1
writeback_size_limit: 500MB
default_validity: 18h
memcached:
batch_size: 256
parallelism: 10
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-chunks-cache.observability.svc
consistent_hash: true
max_idle_conns: 72
timeout: 2000ms
common:
compactor_address: 'http://loki-backend:3100'
path_prefix: /var/loki
replication_factor: 3
storage:
s3:
access_key_id: ${S3_ACCESS_KEY}
bucketnames: ${S3_MULTIBUCKET}
endpoint: ${S3_ENDPOINT}
http_config:
insecure_skip_verify: true
insecure: false
region: eu-west-1
s3forcepathstyle: true
secret_access_key: ${S3_SECRET_ACCESS_KEY}
compactor:
compaction_interval: 5m
delete_batch_size: 2100
delete_request_cancel_period: 10m
delete_request_store: s3-multibucket
max_compaction_parallelism: 2
retention_delete_worker_count: 300
retention_enabled: true
upload_parallelism: 20
frontend:
log_queries_longer_than: 10s
max_outstanding_per_tenant: 4096
scheduler_address: ""
tail_proxy_url: ""
frontend_worker:
scheduler_address: ""
index_gateway:
mode: simple
ingester:
chunk_encoding: snappy
chunk_target_size: 4194304
flush_op_timeout: 10m
max_chunk_age: 168h
wal:
enabled: false
limits_config:
allow_structured_metadata: true
bloom_creation_enabled: true
bloom_gateway_enable_filtering: true
cardinality_limit: 1000000
discover_service_name:
- service_name
- job
ingestion_burst_size_mb: 300
ingestion_rate_mb: 200
ingestion_rate_strategy: local
max_cache_freshness_per_query: 5m
max_entries_limit_per_query: 50000
max_global_streams_per_user: 0
max_line_size: 0
max_querier_bytes_read: 0
max_query_parallelism: 64
max_query_series: 20000
max_streams_matchers_per_query: 5000
per_stream_rate_limit: 200MB
per_stream_rate_limit_burst: 500MB
query_timeout: 5m
reject_old_samples: false
reject_old_samples_max_age: 168h
retention_period: 744h
shard_streams:
enabled: false
split_queries_by_interval: 15m
tsdb_max_query_parallelism: 300
tsdb_sharding_strategy: bounded
unordered_writes: true
volume_enabled: true
memberlist:
cluster_label: loki
join_members:
- loki-memberlist
pattern_ingester:
enabled: true
querier:
max_concurrent: 10
query_ingesters_within: 169h
query_range:
align_queries_with_step: true
cache_results: true
parallelise_shardable_queries: true
results_cache:
cache:
background:
writeback_buffer: 500000
writeback_goroutines: 1
writeback_size_limit: 500MB
default_validity: 12h
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-results-cache.observability.svc
consistent_hash: true
timeout: 500ms
update_interval: 1m
query_scheduler:
max_outstanding_requests_per_tenant: 32768
ruler:
alertmanager_url: http://prometheus-alertmanager-headless.monitoring-system.svc.cluster.local:9093/
enable_alertmanager_v2: true
enable_api: true
enable_sharding: true
evaluation:
mode: remote
query_frontend:
address: dns:///loki-read.observability.svc.cluster.local.:9095
external_url: 'REDACTED'
remote_write:
clients:
prometheusReplica0:
queue_config:
capacity: 10000
retry_on_http_429: true
url: http://kps-prometheus-replica-0.monitoring-system.svc.cluster.local:9090/api/v1/write
prometheusReplica1:
queue_config:
capacity: 10000
retry_on_http_429: true
url: http://kps-prometheus-replica-1.monitoring-system.svc.cluster.local:9090/api/v1/write
enabled: true
ring:
kvstore:
store: inmemory
rule_path: /var/loki/scratch
sharding_algo: by-rule
storage:
local:
directory: /var/loki/rules
type: local
wal:
dir: /var/loki/wal
runtime_config:
file: /etc/loki/runtime-config/runtime-config.yaml
schema_config:
configs:
- from: "2023-05-01"
index:
period: 24h
prefix: loki_index_tsdb_
object_store: s3
row_shards: 32
schema: v12
store: tsdb
- from: "2023-11-29"
index:
period: 24h
prefix: loki_index_tsdb_
object_store: s3-multibucket
row_shards: 32
schema: v12
store: tsdb
- from: "2024-05-09"
index:
period: 24h
prefix: loki_index_tsdb_
object_store: s3-multibucket
row_shards: 32
schema: v13
store: tsdb
server:
grpc_listen_port: 9095
grpc_server_max_concurrent_streams: 2000
grpc_server_max_recv_msg_size: 90971520
grpc_server_max_send_msg_size: 90971520
http_listen_port: 3100
http_server_idle_timeout: 20m
http_server_read_timeout: 10m
http_server_write_timeout: 10m
log_level: debug
storage_config:
bloom_shipper:
working_directory: /var/loki/data/blooms
boltdb_shipper:
index_gateway_client:
server_address: ""
hedging:
at: 250ms
max_per_second: 20
up_to: 3
named_stores:
aws:
s3-multibucket:
access_key_id: ${S3_ACCESS_KEY}
bucketnames: ${S3_MULTIBUCKET}
endpoint: ${S3_ENDPOINT}
http_config:
insecure_skip_verify: true
region: eu-west-1
s3forcepathstyle: true
secret_access_key: ${S3_SECRET_ACCESS_KEY}
tsdb_shipper:
index_gateway_client:
server_address: dns+loki-backend-headless.observability.svc.cluster.local:9095
tracing:
enabled: false
hi @chaudum - were you able to reproduce ? I can try some things out if you think this might be a configuration issue.
@vladst3f I was able to reproduce the issue and pushed a fix https://github.com/grafana/loki/pull/15994
cheers @chaudum, much appreciated