loki
loki copied to clipboard
Bloom Builder is seemingly stuck unable to finish any tasks
Describe the bug
I've upgraded Loki to v3.2.0 in hopes that new planner/builder architecture will finally make blooms work for me.
I can see that blooms somewhat work by observing activity in loki_blooms_created_total and loki_bloom_gateway_querier_chunks_filtered_total metrics, confirming write and read path respectively.
But my problem is: builders never finish crunching... something.
Some graphs to illustrate my point:
I've ran day-long experiment.
I've had HPA with maxReplicas=100 for bloom-builder that was peaking the whole time until I gave up.
Despite constant amount of BUSY builder replicas the amount of created blooms dropped significantly after an hour or so.
Which correlates quite well with the amount of inflight tasks from the point of view of planner: it quickly crunches through the most of the backlog, but then gets stuck on last couple percent of tasks.
And it's reflected in planner logs. This is a particular line I've noticed:
level=error ts=2024-09-25T09:29:09.612662713Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=1024 tasksSucceed=986 tasksFailed=38
level=error ts=2024-09-25T09:29:09.612673171Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=1024 tasksSucceed=982 tasksFailed=42
level=error ts=2024-09-25T11:07:15.61504996Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=42 tasksSucceed=15 tasksFailed=27
level=error ts=2024-09-25T11:07:15.615066562Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=38 tasksSucceed=12 tasksFailed=26
level=error ts=2024-09-25T12:27:26.153443515Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=20 tasksSucceed=0 tasksFailed=20
level=error ts=2024-09-25T12:27:26.153484978Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=21 tasksSucceed=0 tasksFailed=21
level=error ts=2024-09-25T13:34:14.48146035Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=20 tasksSucceed=0 tasksFailed=20
level=error ts=2024-09-25T13:34:14.481488344Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=16 tasksSucceed=0 tasksFailed=16
level=error ts=2024-09-25T16:38:09.108673794Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=17 tasksSucceed=2 tasksFailed=15
level=error ts=2024-09-25T16:38:09.108700267Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=16 tasksSucceed=1 tasksFailed=15
level=error ts=2024-09-25T17:53:09.826561108Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=14 tasksSucceed=0 tasksFailed=14
level=error ts=2024-09-25T17:53:09.826607695Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=13 tasksSucceed=0 tasksFailed=13
level=error ts=2024-09-25T18:00:15.54654397Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=14 tasksSucceed=0 tasksFailed=14
level=error ts=2024-09-25T18:00:15.546538932Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=13 tasksSucceed=0 tasksFailed=13
level=error ts=2024-09-25T19:06:15.889067817Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=13 tasksSucceed=0 tasksFailed=13
level=error ts=2024-09-25T19:06:15.889094442Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=14 tasksSucceed=0 tasksFailed=14
At some point tasks just cannot succeed anymore, constantly re-queueing tasks:
From the builder side I notice this log line that corresponds to planner's ones:
level=error ts=2024-09-25T08:19:50.762194822Z caller=builder.go:147 component=bloom-builder builder_id=7c20b7dd-abf8-404f-a880-9457ee5c7f1d msg="failed to connect and build. Retrying" err="builder loop failed: failed to notify task completion to planner: failed to acknowledge task completion to planner: EOF"
So there seems to be some miscommunication between builder and planner:
- planner creates the task
- builder processes it, keeps CPU busy and HPA peaking
- builder fails to report task to planner
- planner re-queues task
- I'm in an infinite loop of wasted resources
To Reproduce If needed, can provide complete pod logs and Loki configuration.
Relevant part of Helm values.yaml:
loki:
deploymentMode: Distributed
loki:
### FOR BLOOM BUILDER ###
containerSecurityContext:
readOnlyRootFilesystem: false
image:
tag: 3.2.0
### FOR BLOOM BUILDER ###
limits_config:
retention_period: 8760h # 365 days
max_query_lookback: 8760h # 365 days
query_timeout: 10m
ingestion_rate_mb: 30
ingestion_burst_size_mb: 100
max_global_streams_per_user: 60000
split_queries_by_interval: 30m # default = 30m
tsdb_max_query_parallelism: 1024 # default = 512
bloom_creation_enabled: true
bloom_split_series_keyspace_by: 1024
bloom_gateway_enable_filtering: true
allow_structured_metadata: true
storage_config:
tsdb_shipper:
active_index_directory: /var/loki/tsdb-index
cache_location: /var/loki/tsdb-cache
distributor:
otlp_config:
default_resource_attributes_as_index_labels:
- service.name
# - service.namespace
# - service.instance.id
# - deployment.environment
# - cloud.region
# - cloud.availability_zone
- k8s.cluster.name
- k8s.namespace.name
# - k8s.pod.name
# - k8s.container.name
# - container.name
# - k8s.replicaset.name
- k8s.deployment.name
- k8s.statefulset.name
- k8s.daemonset.name
- k8s.cronjob.name
# - k8s.job.name
compactor:
retention_enabled: true
delete_request_store: s3
query_scheduler:
max_outstanding_requests_per_tenant: 32768 # default = 100
querier:
max_concurrent: 16 # default = 10
server:
grpc_server_max_recv_msg_size: 16777216 # default = 4194304
grpc_server_max_send_msg_size: 16777216 # default = 4194304
# Things that are not yet in main config template
structuredConfig:
bloom_build:
enabled: true
planner:
planning_interval: 6h
builder:
planner_address: loki-bloom-planner-headless.<namespace>.svc.cluster.local.:9095
bloom_gateway:
enabled: true
client:
addresses: dnssrvnoa+_grpc._tcp.loki-bloom-gateway-headless.<namespace>.svc.cluster.local.
bloomPlanner:
extraArgs:
- -log.level=debug
replicas: 1
resources:
requests:
cpu: 100m
memory: 1Gi
limits:
memory: 1Gi
nodeSelector:
zone: us-east-1a
tolerations:
- <<: *tolerate-arm64
bloomBuilder:
extraArgs:
- -log.level=debug
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 10
resources:
requests:
cpu: 2
memory: 2Gi
limits:
memory: 2Gi
extraVolumes:
- name: blooms
emptyDir: {}
extraVolumeMounts:
- name: blooms
mountPath: /var/loki/blooms
nodeSelector:
zone: us-east-1a
tolerations:
- <<: *tolerate-arm64
bloomGateway:
extraArgs:
- -log.level=debug
replicas: 2
resources:
requests:
cpu: 1
memory: 1Gi
limits:
memory: 1Gi
nodeSelector:
zone: us-east-1a
tolerations:
- <<: *tolerate-arm64
Expected behavior After initial bloom building builders stabilize at much lower resource consumption.
Environment:
- Infrastructure: EKS 1.29
- Deployment tool: Helm 6.12.0
Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem.
Hey @zarbis thanks for trying out the experimental bloom filter feature and reporting the issue. Did you see any restarts of the bloom-planner component while the builder processed the task?
@salvacorts Do you wanna take a look?
@chaudum checked back that day - zero restarts for planner.
@zarbis besides unexpected restart due to errors OOMs, what about restarts due to k8s scheduling?
Can you verify the pod was up (and on the same node) all the way through the planning process?
@salvacorts sorry for the late reply. No, pod was changing nodes during the day. I run monitoring cluster fully on spots and up until now LGTM stack has zero problems with that.
This is the graph of bloom-planner pod changing nodes throughout build process (almost a full day).
Value is node age in minutes.
Hope it helps.
Those node changes is what's making those task fail. The way I have in mind to resolve this case is:
- Store the tasks queue state in the storage: So on restarts we restore it from there.
- Backoff-retry in the builders to ask task completion when planner is down
Unfortunately I don't think we will be implementing this behaviour in the short-term unless we see this problem more often.
I run monitoring cluster fully on spots
Consider running the planner in non-spot nodes to mitigate this issue (also running other Loki components in spot nodes is not ideal. E.g. ingesters)
Thank you, will try that.
But I wonder how is it a problem, especially for ingesters, that have persistence AND redundancy in form of replication factor=3?
Hey @salvacorts, I'm back with the results.
So I've ran bloom-planner on on-demand node and made sure it didn't restart during the test.
Metrics
This is the overview screenshot of the process:
Different graphs don't quite fit in a single panel, so I will go over them one by one, but at least you can see exact queries.
I've started with 10 builders and bumped it to 50 to finish the test today:
Completion rate grew initially with additional replicas but soon fizzled out:
Requeue rate has two spikes: after initial start and after I've added fresh builders:
Interestingly loki_blooms_created_total is zero the entire time (maybe it's outdated and I should look elsewhere?):
Eventually (notice time range zoom to 15 minutes) amount of inflight tasks is stuck roughly around amount of builder replicas which keeps them perfectly occupied with... something:
Logs
I've enabled debug logs for both planner and builders, but there is just too much stuff, so I will start with error level logs.
Planner
There are only two types of errors I see no planner side:
caller=planner.go:1007 msg="error forwarding task to builder, Task requeued" retries=2 err="error receiving response from builder (7b778cb5-ac3e-4df9-bc07-0521d6e7e45d): rpc error: code = Canceled desc = context canceled"
caller=dns.go:78 msg="failed to resolve server addresses" err="lookup IP addresses \"loki-index-gateway-headless.centralized-monitoring.svc.cluster.local\": lookup loki-index-gateway-headless.centralized-monitoring.svc.cluster.local: i/o timeout"
Builder
There is more variety in builders' logs:
Some generic errors:
caller=builder.go:147 msg="failed to connect and build. Retrying" err="builder loop failed: failed to notify task completion to planner: failed to acknowledge task completion to planner: EOF"
caller=builder.go:445 msg="failed to generate bloom" err="failed to build bloom block: iterating store: context canceled"
Notice how this error re-surfaces even when I'm super confident that planner was always available.
Some consistency errors:
caller=memchunk.go:492 msg="block offset does not match expected one, will try reading with expected offset" actual=204708 expected=155229
caller=memchunk.go:489 msg="invalid block found" err="invalid chunk checksum"
Some network errors:
caller=memcached_client_selector.go:98 msg="could not resolve TCP address of server" server=loki-chunks-cache-0.loki-chunks-cache.centralized-monitoring.svc.cluster.local.:11211 err="lookup loki-chunks-cache-0.loki-chunks-cache.centralized-monitoring.svc.cluster.local. on 10.100.0.10:53: no such host"
caller=resolver.go:100 msg="failed to lookup SRV records" host=_memcached-client._tcp.loki-chunks-cache.centralized-monitoring.svc err="lookup _memcached-client._tcp.loki-chunks-cache.centralized-monitoring.svc on 10.100.0.10:53: no such host"
caller=dns.go:78 msg="failed to resolve server addresses" err="lookup IP addresses \"loki-index-gateway-headless.centralized-monitoring.svc.cluster.local\": lookup loki-index-gateway-headless.centralized-monitoring.svc.cluster.local: i/o timeout"
So the end result currently is the same: no matter how many builders I throw at a task - they find some infinite loop to work on.
Update
I left this setup running overnight and with enough retries it managed to process all work and scale down naturally:
@salvacorts did you get a chance to look into this? Maybe I can provide more info?
In general, I've ran bloom generation for several days and every time I observe those extremely long tails, where remaining 1% of tasks take ~80% of total time to complete. I've stopped the process after several days since it's a huge resource waster at the moment for me, but I hope I can resume using this long-awaited feature.
@zarbis We recently merged https://github.com/grafana/loki/pull/14988 which I think should help here. You'll need to configure the bloom_build_builder_response_timeout for the tenant. You may end up with some failing tasks tho.
In addition to the above, we introduced a new planning strategy that should yield more evenly spread tasks in terms of the amount of data they need to process. You'll need to set bloom_planning_strategy: 'split_by_series_chunks_size' for your tenant.
These changes are available in main and will be available in the next Loki release.
@salvacorts can I set both values globally in loki.limits_config? And what is a good value for timeout? This is time I expect single task to finish, so when planner generates 4k tasks and most of them finish in 4 hours (so 1k/hour), should I put timeout to the average rate of 3600/1000 seconds times some margin? like 30 seconds?
can I set both values globally in loki.limits_config
Yes
so when planner generates 4k tasks and most of them finish in 4 hours (so 1k/hour), should I put timeout to the average rate of 3600/1000 seconds times some margin? like 30 seconds?
No, some tasks may take longer than other regardless of the average time. I'd recommend setting whatever that feels long enough to worth cancelling and retrying. E.g. 15 minutes.
I was able to complete bloom building run in kinda linear fashion without extremely long tail this time:
with following settings:
loki:
image:
tag: main-479b44e
limits_config:
bloom_build_builder_response_timeout: 15m
bloom_planning_strategy: split_by_series_chunks_size
Although not all tasks succeeded:
caller=planner.go:477 component=bloom-planner table=loki_index_20046 tenant=<TENANT> msg="all tasks completed for tenant table" tasks=49 tasksSucceed=41 originalMetas=24 newMetas=41
caller=planner.go:540 component=bloom-planner table=loki_index_20046 tenant=<TENANT> phase=building msg="found outdated metas" outdated=14 upToDate=51
caller=planner.go:597 component=bloom-planner table=loki_index_20046 tenant=<TENANT> phase=building msg="deleted outdated metas and blocks" metas=14 blocks=15
caller=planner.go:338 component=bloom-planner table=loki_index_20046 tenant=<TENANT> msg="not all tasks succeeded for tenant table" tasks=49 tasksSucceed=41 tasksFailed=8
caller=planner.go:357 component=bloom-planner msg="bloom build iteration completed" tasks=80 tasksSucceed=53 duration=6907.993171493