loki icon indicating copy to clipboard operation
loki copied to clipboard

Bloom Builder is seemingly stuck unable to finish any tasks

Open zarbis opened this issue 5 months ago • 1 comments

Describe the bug I've upgraded Loki to v3.2.0 in hopes that new planner/builder architecture will finally make blooms work for me. I can see that blooms somewhat work by observing activity in loki_blooms_created_total and loki_bloom_gateway_querier_chunks_filtered_total metrics, confirming write and read path respectively.

But my problem is: builders never finish crunching... something. Some graphs to illustrate my point: I've ran day-long experiment. I've had HPA with maxReplicas=100 for bloom-builder that was peaking the whole time until I gave up. Image

Despite constant amount of BUSY builder replicas the amount of created blooms dropped significantly after an hour or so. Image

Which correlates quite well with the amount of inflight tasks from the point of view of planner: it quickly crunches through the most of the backlog, but then gets stuck on last couple percent of tasks. Image

And it's reflected in planner logs. This is a particular line I've noticed:

level=error ts=2024-09-25T09:29:09.612662713Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=1024 tasksSucceed=986 tasksFailed=38
level=error ts=2024-09-25T09:29:09.612673171Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=1024 tasksSucceed=982 tasksFailed=42
level=error ts=2024-09-25T11:07:15.61504996Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=42 tasksSucceed=15 tasksFailed=27
level=error ts=2024-09-25T11:07:15.615066562Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=38 tasksSucceed=12 tasksFailed=26
level=error ts=2024-09-25T12:27:26.153443515Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=20 tasksSucceed=0 tasksFailed=20
level=error ts=2024-09-25T12:27:26.153484978Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=21 tasksSucceed=0 tasksFailed=21
level=error ts=2024-09-25T13:34:14.48146035Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=20 tasksSucceed=0 tasksFailed=20
level=error ts=2024-09-25T13:34:14.481488344Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=16 tasksSucceed=0 tasksFailed=16
level=error ts=2024-09-25T16:38:09.108673794Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=17 tasksSucceed=2 tasksFailed=15
level=error ts=2024-09-25T16:38:09.108700267Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=16 tasksSucceed=1 tasksFailed=15
level=error ts=2024-09-25T17:53:09.826561108Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=14 tasksSucceed=0 tasksFailed=14
level=error ts=2024-09-25T17:53:09.826607695Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=13 tasksSucceed=0 tasksFailed=13
level=error ts=2024-09-25T18:00:15.54654397Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=14 tasksSucceed=0 tasksFailed=14
level=error ts=2024-09-25T18:00:15.546538932Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=13 tasksSucceed=0 tasksFailed=13
level=error ts=2024-09-25T19:06:15.889067817Z caller=planner.go:342 component=bloom-planner table=loki_index_19990 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=13 tasksSucceed=0 tasksFailed=13
level=error ts=2024-09-25T19:06:15.889094442Z caller=planner.go:342 component=bloom-planner table=loki_index_19989 tenant=<REDACTED> msg="not all tasks succeeded for tenant table" tasks=14 tasksSucceed=0 tasksFailed=14

At some point tasks just cannot succeed anymore, constantly re-queueing tasks: Image

From the builder side I notice this log line that corresponds to planner's ones:

level=error ts=2024-09-25T08:19:50.762194822Z caller=builder.go:147 component=bloom-builder builder_id=7c20b7dd-abf8-404f-a880-9457ee5c7f1d msg="failed to connect and build. Retrying" err="builder loop failed: failed to notify task completion to planner: failed to acknowledge task completion to planner: EOF"

So there seems to be some miscommunication between builder and planner:

  1. planner creates the task
  2. builder processes it, keeps CPU busy and HPA peaking
  3. builder fails to report task to planner
  4. planner re-queues task
  5. I'm in an infinite loop of wasted resources

To Reproduce If needed, can provide complete pod logs and Loki configuration.

Relevant part of Helm values.yaml:

loki:
  deploymentMode: Distributed
  loki:
    ### FOR BLOOM BUILDER ###
    containerSecurityContext:
      readOnlyRootFilesystem: false
    image:
      tag: 3.2.0
    ### FOR BLOOM BUILDER ###
    limits_config:
      retention_period: 8760h # 365 days
      max_query_lookback: 8760h # 365 days
      query_timeout: 10m
      ingestion_rate_mb: 30
      ingestion_burst_size_mb: 100
      max_global_streams_per_user: 60000
      split_queries_by_interval: 30m # default = 30m
      tsdb_max_query_parallelism: 1024 # default = 512
      bloom_creation_enabled: true
      bloom_split_series_keyspace_by: 1024
      bloom_gateway_enable_filtering: true
      allow_structured_metadata: true
    storage_config:
      tsdb_shipper:
        active_index_directory: /var/loki/tsdb-index
        cache_location: /var/loki/tsdb-cache
    distributor:
      otlp_config:
        default_resource_attributes_as_index_labels:
        - service.name
        # - service.namespace
        # - service.instance.id
        # - deployment.environment
        # - cloud.region
        # - cloud.availability_zone
        - k8s.cluster.name
        - k8s.namespace.name
        # - k8s.pod.name
        # - k8s.container.name
        # - container.name
        # - k8s.replicaset.name
        - k8s.deployment.name
        - k8s.statefulset.name
        - k8s.daemonset.name
        - k8s.cronjob.name
        # - k8s.job.name
    compactor:
      retention_enabled: true
      delete_request_store: s3
    query_scheduler:
      max_outstanding_requests_per_tenant: 32768 # default = 100
    querier:
      max_concurrent: 16 # default = 10
    server:
      grpc_server_max_recv_msg_size: 16777216 # default = 4194304
      grpc_server_max_send_msg_size: 16777216 # default = 4194304
    # Things that are not yet in main config template
    structuredConfig:
      bloom_build:
        enabled: true
        planner:
          planning_interval: 6h
        builder:
          planner_address: loki-bloom-planner-headless.<namespace>.svc.cluster.local.:9095
      bloom_gateway:
        enabled: true
        client:
          addresses: dnssrvnoa+_grpc._tcp.loki-bloom-gateway-headless.<namespace>.svc.cluster.local.

  bloomPlanner:
    extraArgs:
    - -log.level=debug
    replicas: 1
    resources:
      requests:
        cpu: 100m
        memory: 1Gi
      limits:
        memory: 1Gi
    nodeSelector:
      zone: us-east-1a
    tolerations:
    - <<: *tolerate-arm64

  bloomBuilder:
    extraArgs:
    - -log.level=debug
    autoscaling:
      enabled: true
      minReplicas: 1
      maxReplicas: 10
    resources:
      requests:
        cpu: 2
        memory: 2Gi
      limits:
        memory: 2Gi
    extraVolumes:
      - name: blooms
        emptyDir: {}
    extraVolumeMounts:
      - name: blooms
        mountPath: /var/loki/blooms
    nodeSelector:
      zone: us-east-1a
    tolerations:
    - <<: *tolerate-arm64

  bloomGateway:
    extraArgs:
    - -log.level=debug
    replicas: 2
    resources:
      requests:
        cpu: 1
        memory: 1Gi
      limits:
        memory: 1Gi
    nodeSelector:
      zone: us-east-1a
    tolerations:
    - <<: *tolerate-arm64

Expected behavior After initial bloom building builders stabilize at much lower resource consumption.

Environment:

  • Infrastructure: EKS 1.29
  • Deployment tool: Helm 6.12.0

Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem.

zarbis avatar Sep 26 '24 11:09 zarbis