tempo icon indicating copy to clipboard operation
tempo copied to clipboard

scalable single binary memory leak

Open LeonD9 opened this issue 2 years ago • 7 comments

Describe the bug When running tempo scalable single binary statefulset on kubernetes with large traffic(100k+ spans per second) one/two pods start memory leaking and the rest has average memory usage.

image image

To Reproduce Steps to reproduce the behavior:

  1. Start Tempo single scalable binary version 1.5 as statefulset in k8s with the config below
  2. Start Open telemetry collector version 0.58.0 as deployment in k8s with the config below
  3. Run application that writes a large amount of spans per second to reach 100k+ spans

Expected behavior All pods have the same memory usage ( traffic seems to be balanced )

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: helm
  • 5 Pods of tempo single scalable binary version 1.5 run as statefulset
  • 5 Pods of open telemetry collector version 0.58.0 run as deployment

Additional Context This is the configuration i use in tempo:

multitenancy_enabled: false

search_enabled: true

metrics_generator_enabled: true

compactor:
  ring:
    kvstore:
      store: memberlist
    wait_stability_min_duration: 10s
  compaction:
    block_retention: 168h
    compacted_block_retention: 1h
    flush_size_bytes: 5242880
    compaction_cycle: 5s
    compaction_window: 5m
    iterator_buffer_size: 100

distributor:
  ring:
    kvstore:
      store: memberlist
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: "0.0.0.0:4317"
        http:
          endpoint: "0.0.0.0:4318"
  log_received_spans:
    enabled: true
    include_all_attributes: true
    filter_by_status_error: false
  search_tags_deny_list:
    - "os.type"
    - "os.description"
    - "process.command_line"
    - "k8s.pod.uid"
    - "otel.library.name"
    - "net.peer.ip"

ingester:
  trace_idle_period: 1s
  flush_check_period: 1s
  lifecycler:
    ring:
      kvstore:
        store: memberlist
    heartbeat_period: 100ms
  complete_block_timeout: 200ms
  max_block_duration: 1s

server:
  grpc_server_max_send_msg_size: 20000000
  grpc_server_max_recv_msg_size: 20000000
  http_listen_port: 3100
  log_level: info

metrics_generator:
  registry:
    collection_interval: 1s
  storage:
    path: /var/tempo
    remote_write_flush_deadline: 5s
    remote_write:
      - url: http://prometheus-server.kube-system.svc.cluster.local:9090/api/v1/write
        send_exemplars: true
  processor:
    service_graphs:
      max_items: 100000

memberlist:
  abort_if_cluster_join_fails: false
  bind_port: 7946
  join_members:
  - tempo-headless.kube-system.svc.cluster.local:7946

storage:
  trace:
    backend: s3
    s3:
      bucket: example
      endpoint: s3.example-1.amazonaws.com
      region: example-1
      hedge_requests_at: 500ms
      hedge_requests_up_to: 2
    search:
      chunk_size_bytes: 500000
    wal:
      path: /var/tempo/wal
    pool:
      max_workers: 1000
      queue_depth: 100000

querier:
  query_timeout: 2m
  frontend_worker:
    frontend_address: tempo-headless.kube-system.svc.cluster.local:9095
  search:
    query_timeout: 2m
    external_hedge_requests_at: 500ms
    external_hedge_requests_up_to: 4

query_frontend:
  query_shards: 100
  search:
    max_duration: 0
    query_backend_after: 200ms
    query_ingesters_until: 200ms
    target_bytes_per_job: 419430400

overrides:
  max_traces_per_user: 0
  max_search_bytes_per_trace: 0
  max_bytes_per_trace: 50000000
  ingestion_burst_size_bytes: 200000000
  ingestion_rate_limit_bytes: 150000000
  metrics_generator_processors: ['service-graphs', 'span-metrics']

Otel collector config:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        keepalive:
          server_parameters:
            max_connection_idle: 10s
            max_connection_age: 20s
            max_connection_age_grace: 20s
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 60
    spike_limit_percentage: 20
  batch: {}
  k8sattributes:
    auth_type: serviceAccount
    passthrough: false

    extract:
      metadata:
      - k8s.namespace.name
      - k8s.node.name
      - k8s.pod.name
      - k8s.pod.uid
      - k8s.pod.start_time
      - k8s.deployment.name

    pod_association:
    - sources:
      - from: resource_attribute
        name: k8s.pod.ip
    - sources:
      - from: resource_attribute
        name: k8s.pod.uid
    - sources:
      - from: connection

  filter:
    spans:
      exclude:
        match_type: strict
        span_names:
          - /metrics

exporters:
  otlp/tempo:
    endpoint: dns:///tempo-headless.kube-system.svc.cluster.local:4317
    tls:
      insecure: true
      insecure_skip_verify: true
    balancer_name: "round_robin"

extensions:
  health_check:
     endpoint: 0.0.0.0:13133
  memory_ballast:
    size_in_percentage: 20

service:
  pipelines:
    traces:
      receivers:
      - otlp
      processors:
      - memory_limiter
      - k8sattributes
      - filter
      - batch 
      exporters:
      - otlp/tempo
  extensions:
  - health_check
  - memory_ballast
  telemetry:
    logs:
      level: debug
    metrics:
      level: detailed
      address: 0.0.0.0:8888

Ingester Push latency during test: image

LeonD9 avatar Aug 29 '22 10:08 LeonD9

That's a tough one. I suppose let's look into reasons that one of your 5 instances may be having issues.

  • Are there any log messages that indicate something happened when the memory bump occurred. Perhaps that shard began compacting some blocks?
  • Tenant index building should be fairly light work, but it is one of the few asymmetric processes in Tempo. Can you check to see if your problem shard is also building the tenant index?
  • Your config looks like you are running RF1. You can likely use bytes received to check if that ingester received a particularly large trace. It does require a lot of memory to combine large traces.
  • We have seen similar elevated memory situations in k8s when there are other pods on the same node that are consuming a large amount of CPU. Can you cordon the node, delete the tempo pod (causing it be rescheduled somewhere else), and see if the issue returns?

What other patterns are you observing that may give us clues as to what's going on?

  • Does memory continue to grow or does it spike to a value and hover?
  • Does this always happen a certain amount of time after starting up the cluster and sending it traffic?
  • Does the memory ever come back down?

If these questions don't lead us anywhere the next step will be to grab a memory profile from a "good" node and your one with elevated memory.

joe-elliott avatar Aug 29 '22 14:08 joe-elliott

@joe-elliott The spans i use for testing are from https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/cmd/tracegen so they are pretty small.

I tried moving tempo to specific nodes without any other pods and it seems it works better as you suggested, but still two pods crashed with OOM when Ingester Push latency .99 increased from 3ms to 80ms. image

The memory comes back down after few restarts (I suppose when it manages to send the overhead spans)

The memory spikes in two pods and then return to normal after few restarts: image It happened when span amount rose from 180k per second to 250k per second.

As far as i see the traffic is the same for each pod so there shouldn't be a reason for two pods to get OOM while the rest don't.

The CPU usage gets to 3 cores (i put limit as 4 cores): image

I see this in logs as well:

level=debug ts=2022-09-13T09:52:56.948237143Z caller=grpc_logging.go:46 duration=88.00726ms method=/tempopb.Pusher/PushBytesV2 msg="gRPC (success)"
level=debug ts=2022-09-13T09:52:56.957450729Z caller=grpc_logging.go:46 method=/tempopb.Pusher/PushBytesV2 duration=292.975917ms msg="gRPC (success)"
level=debug ts=2022-09-13T09:52:56.957732331Z caller=grpc_logging.go:46 method=/tempopb.Pusher/PushBytesV2 duration=89.710202ms msg="gRPC (success)"
level=debug ts=2022-09-13T09:52:56.964219008Z caller=grpc_logging.go:46 method=/tempopb.Pusher/PushBytesV2 duration=216.424919ms msg="gRPC (success)"
level=debug ts=2022-09-13T09:52:56.964960303Z caller=grpc_logging.go:46 duration=107.006746ms method=/tempopb.Pusher/PushBytesV2 msg="gRPC (success)"
level=debug ts=2022-09-13T09:52:57.045788132Z caller=grpc_logging.go:46 method=/tempopb.MetricsGenerator/PushSpans duration=98.823837ms msg="gRPC (success)"
level=debug ts=2022-09-13T09:52:57.052089117Z caller=grpc_logging.go:46 method=/tempopb.Pusher/PushBytesV2 duration=111.846331ms msg="gRPC (success)"
level=debug ts=2022-09-13T09:52:57.053286665Z caller=grpc_logging.go:46 method=/tempopb.Pusher/PushBytesV2 duration=9.938571ms msg="gRPC (success)"
level=debug ts=2022-09-13T09:52:57.05538586Z caller=grpc_logging.go:46 duration=8.868563ms method=/tempopb.MetricsGenerator/PushSpans msg="gRPC (success)"
level=debug ts=2022-09-13T09:52:57.057480325Z caller=grpc_logging.go:46 duration=13.407956ms method=/tempopb.Pusher/PushBytesV2 msg="gRPC (success)"
level=debug ts=2022-09-13T09:52:57.141855049Z caller=grpc_logging.go:46 method=/tempopb.MetricsGenerator/PushSpans duration=98.454635ms msg="gRPC (success)"
level=debug ts=2022-09-13T09:52:57.14480785Z caller=grpc_logging.go:46 method=/tempopb.Pusher/PushBytesV2 duration=101.142374ms msg="gRPC (success)"
level=debug ts=2022-09-13T09:52:57.148701838Z caller=grpc_logging.go:46 method=/tempopb.MetricsGenerator/PushSpans duration=93.176827ms msg="gRPC (success)"
level=debug ts=2022-09-13T09:52:57.153317901Z caller=grpc_logging.go:46 method=/tempopb.Pusher/PushBytesV2 duration=188.103386ms msg="gRPC (success)"
level=debug ts=2022-09-13T09:52:57.154177687Z caller=memberlist_client.go:842 msg="CAS attempt failed" err="no change detected" retry=true

Any other information i can give to find the root of this issue?

LeonD9 avatar Sep 13 '22 08:09 LeonD9

So a spike in latency to ingesters should spike memory in the distributors. In this case pushes are just getting queued up by the go grpc clients. This is not intentional queueing in Tempo, it's just a consequence of the slowdown between distributors -> ingesters. Although, I'll say that memory jump is quite large. Perhaps, tuning this setting down will help the application load shed better:

distributor:
  ingester_client:
    remote_timeout: 100ms (this defaults to 5s which is really high)

I'd be interested in tempo_distributor_ingester_append_failures_total. This metric shows if a distributor in one pod is having issues sending to one of the ingesters. Network slowdown between two of the pods may be why these issues are occuring.

Also, we're also tracking a potential memory issue in the distributors when the metrics generator queue backs up. Perhaps try disabling that component and see if you continue to see issues?

Finally, this may take a bit more work, but the fully distributed mode would help isolate blast radius when components have issues and may also help us diagnose which component is the one actually blowing up. I know that is likely not a great answer since it's clear you are trying to get the scalable single binary to work, but it's worth mentioning.

joe-elliott avatar Sep 14 '22 18:09 joe-elliott

Indeed, we are observing something similar issue but in microservices setup mode, which allows us to identify that distributor eats memory and being OOM killed

Many thanks to @joe-elliott for a suggestions of what should we look at we already figured out that the reason for that might be an metrics generator stuff and its queue which indeed filled up right before issue happens

So if you have such metrics can you please check if tempo_distributor_forwarder_queue_length have spikes before memory spike

mac2000 avatar Sep 15 '22 06:09 mac2000

I checked both metrics you suggested before the memory spikes and this are the results: tempo_distributor_ingester_append_failures_total image tempo_distributor_forwarder_queue_length image tempo_distributor_metrics_generator_pushes_failures_total image

After that i changed remote_timeout to 100ms and metrics_generator_forwarder_queue_size to 50: tempo_distributor_ingester_append_failures_total image tempo_distributor_forwarder_queue_length image tempo_distributor_metrics_generator_pushes_failures_total image

after the changes it seems that i started receiving more errors like this:

level=info ts=2022-09-15T10:02:54.051710111Z caller=flush.go:300 msg="flushing block" userid=single-tenant block=e1d8fdbc-2446-4558-adb5-b2b64dcce7ab
level=info ts=2022-09-15T10:02:54.344289832Z caller=flush.go:168 msg="head block cut. enqueueing flush op" userid=single-tenant block=89568b70-5ab4-4a6c-bb3b-c0d392b72f3a
level=error ts=2022-09-15T10:02:54.756081785Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
level=error ts=2022-09-15T10:02:54.857396304Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
level=info ts=2022-09-15T10:02:55.993470567Z caller=flush.go:244 msg="completing block" userid=single-tenant blockID=1c36ffc1-3b6a-4213-b11a-345d51e3c21f
level=info ts=2022-09-15T10:02:56.39963233Z caller=flush.go:168 msg="head block cut. enqueueing flush op" userid=single-tenant block=17f1e6a0-88f5-4749-8be5-0fc9db6a2829

And have more spans refused: image

Does it mean the ingester_client remote_timeout need to be increased to avoid this errors?

Also before getting to OOM it seems the CPU usage reaches the limit: image

LeonD9 avatar Sep 15 '22 09:09 LeonD9

Does it mean the ingester_client remote_timeout need to be increased to avoid this errors?

Yes. If you're using RF3 then Tempo only needs 2 of 3 writes to the ingesters to succeed for the distributors to return success. However that "pusher failed..." error message indicates that 2 or more failed. Increasing that

Also before getting to OOM it seems the CPU usage reaches the limit:

If CPU is being throttled then I believe it is expected for a go application to OOM. It's basically an indication you are exceeding the resources given it. I would be interested in this metric to see how hard the GC is working. rate(go_gc_duration_seconds_count{}[1m]) It's possible you are simply pushing your cluster to the edge of what it can do.

Also, with prometheus metrics, most things with a _total postfix should be rated. So it would be more useful to look at rate(tempo_distributor_ingester_append_failures_total[1m]) then the raw metric. In this case I think we can see easy enough what's happening though.

joe-elliott avatar Sep 15 '22 13:09 joe-elliott

@joe-elliott What do you mean by RF3? Also if the distributor fail writing to the ingester to they have retry function to avoid losing spans? or how can i know why it failed writing to them?

Yeah i think it does get to the edge this time (I ran 5 tempo pods with 4 cores each and 4gb memory) i got to 250k spans per second and then 1/2 pods start dying by reaching cpu limit, shouldn't all pods die together when the cluster limit is reached?

This is are the gc metrics: image

If i want to support a larger amount of spans should i increase the amount of pods or resources? or there is no difference from tempo point of view.

LeonD9 avatar Sep 15 '22 13:09 LeonD9

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

github-actions[bot] avatar Nov 19 '22 00:11 github-actions[bot]