loki
loki copied to clipboard
Bloom Builder is seemingly stuck unable to finish any tasks
Describe the bug
I've upgraded Loki to v3.2.0
in hopes that new planner/builder architecture will finally make blooms work for me.
I can see that blooms somewhat work by observing activity in loki_blooms_created_total
and loki_bloom_gateway_querier_chunks_filtered_total
metrics, confirming write and read path respectively.
But my problem is: builders never finish crunching... something.
Some graphs to illustrate my point:
I've ran day-long experiment.
I've had HPA with maxReplicas=100
for bloom-builder
that was peaking the whole time until I gave up.
Despite constant amount of BUSY builder replicas the amount of created blooms dropped significantly after an hour or so.
Which correlates quite well with the amount of inflight tasks from the point of view of planner: it quickly crunches through the most of the backlog, but then gets stuck on last couple percent of tasks.
And it's reflected in planner logs. This is a particular line I've noticed:
level=error ts=2024-09-25T09:29:09.612662713Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=1024 tasksSucceed=986 tasksFailed=38
level=error ts=2024-09-25T09:29:09.612673171Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=1024 tasksSucceed=982 tasksFailed=42
level=error ts=2024-09-25T11:07:15.61504996Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=42 tasksSucceed=15 tasksFailed=27
level=error ts=2024-09-25T11:07:15.615066562Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=38 tasksSucceed=12 tasksFailed=26
level=error ts=2024-09-25T12:27:26.153443515Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=20 tasksSucceed=0 tasksFailed=20
level=error ts=2024-09-25T12:27:26.153484978Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=21 tasksSucceed=0 tasksFailed=21
level=error ts=2024-09-25T13:34:14.48146035Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=20 tasksSucceed=0 tasksFailed=20
level=error ts=2024-09-25T13:34:14.481488344Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=16 tasksSucceed=0 tasksFailed=16
level=error ts=2024-09-25T16:38:09.108673794Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=17 tasksSucceed=2 tasksFailed=15
level=error ts=2024-09-25T16:38:09.108700267Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=16 tasksSucceed=1 tasksFailed=15
level=error ts=2024-09-25T17:53:09.826561108Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=14 tasksSucceed=0 tasksFailed=14
level=error ts=2024-09-25T17:53:09.826607695Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=13 tasksSucceed=0 tasksFailed=13
level=error ts=2024-09-25T18:00:15.54654397Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=14 tasksSucceed=0 tasksFailed=14
level=error ts=2024-09-25T18:00:15.546538932Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=13 tasksSucceed=0 tasksFailed=13
level=error ts=2024-09-25T19:06:15.889067817Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=13 tasksSucceed=0 tasksFailed=13
level=error ts=2024-09-25T19:06:15.889094442Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=14 tasksSucceed=0 tasksFailed=14
At some point tasks just cannot succeed anymore, constantly re-queueing tasks:
From the builder side I notice this log line that corresponds to planner's ones:
level=error ts=2024-09-25T08:19:50.762194822Z caller=builder.go:147 component=bloom-builder builder_id=7c20b7dd-abf8-404f-a880-9457ee5c7f1d msg="failed to connect and build. Retrying" err="builder loop failed: failed to notify task completion to planner: failed to acknowledge task completion to planner: EOF"
So there seems to be some miscommunication between builder and planner:
- planner creates the task
- builder processes it, keeps CPU busy and HPA peaking
- builder fails to report task to planner
- planner re-queues task
- I'm in an infinite loop of wasted resources
To Reproduce If needed, can provide complete pod logs and Loki configuration.
Relevant part of Helm values.yaml
:
loki:
deploymentMode: Distributed
loki:
### FOR BLOOM BUILDER ###
containerSecurityContext:
readOnlyRootFilesystem: false
image:
tag: 3.2.0
### FOR BLOOM BUILDER ###
limits_config:
retention_period: 8760h # 365 days
max_query_lookback: 8760h # 365 days
query_timeout: 10m
ingestion_rate_mb: 30
ingestion_burst_size_mb: 100
max_global_streams_per_user: 60000
split_queries_by_interval: 30m # default = 30m
tsdb_max_query_parallelism: 1024 # default = 512
bloom_creation_enabled: true
bloom_split_series_keyspace_by: 1024
bloom_gateway_enable_filtering: true
allow_structured_metadata: true
storage_config:
tsdb_shipper:
active_index_directory: /var/loki/tsdb-index
cache_location: /var/loki/tsdb-cache
distributor:
otlp_config:
default_resource_attributes_as_index_labels:
- service.name
# - service.namespace
# - service.instance.id
# - deployment.environment
# - cloud.region
# - cloud.availability_zone
- k8s.cluster.name
- k8s.namespace.name
# - k8s.pod.name
# - k8s.container.name
# - container.name
# - k8s.replicaset.name
- k8s.deployment.name
- k8s.statefulset.name
- k8s.daemonset.name
- k8s.cronjob.name
# - k8s.job.name
compactor:
retention_enabled: true
delete_request_store: s3
query_scheduler:
max_outstanding_requests_per_tenant: 32768 # default = 100
querier:
max_concurrent: 16 # default = 10
server:
grpc_server_max_recv_msg_size: 16777216 # default = 4194304
grpc_server_max_send_msg_size: 16777216 # default = 4194304
# Things that are not yet in main config template
structuredConfig:
bloom_build:
enabled: true
planner:
planning_interval: 6h
builder:
planner_address: loki-bloom-planner-headless.<namespace>.svc.cluster.local.:9095
bloom_gateway:
enabled: true
client:
addresses: dnssrvnoa+_grpc._tcp.loki-bloom-gateway-headless.<namespace>.svc.cluster.local.
bloomPlanner:
extraArgs:
- -log.level=debug
replicas: 1
resources:
requests:
cpu: 100m
memory: 1Gi
limits:
memory: 1Gi
nodeSelector:
zone: us-east-1a
tolerations:
- <<: *tolerate-arm64
bloomBuilder:
extraArgs:
- -log.level=debug
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 10
resources:
requests:
cpu: 2
memory: 2Gi
limits:
memory: 2Gi
extraVolumes:
- name: blooms
emptyDir: {}
extraVolumeMounts:
- name: blooms
mountPath: /var/loki/blooms
nodeSelector:
zone: us-east-1a
tolerations:
- <<: *tolerate-arm64
bloomGateway:
extraArgs:
- -log.level=debug
replicas: 2
resources:
requests:
cpu: 1
memory: 1Gi
limits:
memory: 1Gi
nodeSelector:
zone: us-east-1a
tolerations:
- <<: *tolerate-arm64
Expected behavior After initial bloom building builders stabilize at much lower resource consumption.
Environment:
- Infrastructure: EKS 1.29
- Deployment tool: Helm 6.12.0
Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem.