fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

Threads being blocked in forward input plugin

Open Kamil10p opened this issue 2 years ago • 2 comments

Describe the bug

We have configuration where fluentbit + fluentd agents are sending logs to flunetd aggregators, the aggregators is forwarding the logs to loki or fluentd-file, which is saving the logs to file. We observed that aggregators is stopping working after some time. We have tested multi threaded and single threaded configuration and we have the same behavior on both. The thread is getting blocked afer some time (cpu utilization stuck on 100% per thread and we can't any new events in metrics for this thread) and when all threads get blocked, the instance is not responding at all to any connection. We have generated the sigdump for this issue:

sigdump.txt sigdump_reproduced.txt

To Reproduce

We were trying to reproduce this issue on other environment with the exact the same configuration and software stack but most of the tries without success. We have reproduced it only one time, but only for a while with multi threaded configuration.

Expected behavior

Threads are not getting blocked with time.

Your Environment

- Fluentd version: 1.14.6
- TD Agent version: 3.8.1
- Operating system: linux apline 3.13
- Kernel version: 4.15.0-176-generic

Your Configuration

Fluentd.conf:
<system>
  workers 5
</system>

# Input source
<source>
  @id input_source
  @type forward
  port 24224
  skip_invalid_event true
  <transport tcp>
  linger_timeout 1
  </transport>
</source>

<label @FLUENT_LOG>
  <match fluent.**>
    @type stdout
  </match>
</label>

# Monitoring
<source>
  @type prometheus
  bind 0.0.0.0
  port 24231
  metrics_path /metrics
</source>

<source>
  @type prometheus_output_monitor
  interval 10
  <labels>
    host ${hostname}
  </labels>
</source>

<filter goto.** kube.** syslog.** audit.**}>
  @type prometheus
  <metric>
    name fluentd_input_status_num_records_total
    type counter
    desc The total number of incoming records
  </metric>
</filter>

<match kube.**>
  @type copy
  <store ignore_error>
    @id fluent-bit.kube.file
    @type forward
    send_timeout 3s
    connect_timeout 1s
    # primary host
    <server>
      host fluentd-file-0
      port 24224
    </server>
    

    # use file buffer to buffer events on disks.
    <buffer>
      @type memory
      flush_interval 10s
      flush_thread_count 8
      retry_max_interval 30
      retry_forever true
      chunk_limit_size 200MB
    </buffer>

    recover_wait 5s
    tls_insecure_mode true
    slow_flush_log_threshold 60.0
  </store>
  <store ignore_error>
    # grafana-loki
    @type loki
    url "http://#{ENV['LOKI_URL']}"
    extra_labels {"type":"kubernetes"}
    line_format json
    extract_kubernetes_labels true
    remove_keys kubernetes,stream,true,cluster,time,host
    <buffer>
      @type memory
      flush_interval 10s
      retry_max_interval 30
      retry_max_times 5
      retry_wait 1
      chunk_limit_size 200MB
      flush_at_shutdown true
    </buffer>
    <label>
      container_name $.kubernetes.container_name
      namespace_name $.kubernetes.namespace_name
      pod_name $.kubernetes.pod_name
      host $.host
      cluster $.cluster
    </label>
    slow_flush_log_threshold 60.0
  </store>
  <store>
    @type prometheus
    <metric>
      name fluentd_output_status_num_records_total
      type counter
      desc The total number of outgoing records
    </metric>
  </store>
</match>

<match syslog.**>
  @type copy
  <store ignore_error>
    @id fluent-bit.syslog.file
    @type forward
    # primary host
    <server>
      host fluentd-file-0
      port 24224
    </server>
    send_timeout 3s
    connect_timeout 1s
    

    # use file buffer to buffer events on disks.
    <buffer>
      @type memory
      flush_interval 10s
      flush_thread_interval 10s
      flush_thread_count 8
      retry_max_interval 30
      retry_forever true
      chunk_limit_size 200MB
      flush_at_shutdown true
    </buffer>

    recover_wait 5s
    tls_insecure_mode true
    slow_flush_log_threshold 60.0
  </store>
  <store ignore_error>
    # grafana-loki
    @type loki
    url "http://#{ENV['LOKI_URL']}"
    extra_labels {"type":"syslog"}
    line_format json
    extract_kubernetes_labels true
    remove_keys kubernetes,stream,true,cluster,time,host
    <buffer>
      @type memory
      flush_interval 10s
      retry_max_interval 30
      retry_max_times 5
      retry_wait 1
      chunk_limit_size 200MB
      flush_at_shutdown true
    </buffer>
    <label>
      host $.host
      cluster $.cluster
    </label>
    slow_flush_log_threshold 60.0
  </store>
  <store>
    @type prometheus
    <metric>
      name fluentd_output_status_num_records_total
      type counter
      desc The total number of outgoing records
    </metric>
  </store>
</match>

<match audit.**>
  @type copy
  <store ignore_error>
    @id fluent-bit.audit.file
    @type forward
    # primary host
    <server>
      host fluentd-file-0
      port 24224
    </server>
    send_timeout 3s
    connect_timeout 1s
    

    # use file buffer to buffer events on disks.
    <buffer>
      @type memory
      flush_interval 10s
      flush_thread_interval 10s
      flush_thread_count 8
      retry_max_interval 30
      retry_forever true
      chunk_limit_size 200MB
      flush_at_shutdown true
    </buffer>

    recover_wait 5s
    tls_insecure_mode true
    slow_flush_log_threshold 60.0
  </store>
  <store ignore_error>
    # grafana-loki
    @type loki
    url "http://#{ENV['LOKI_URL']}"
    extra_labels {"type":"audit"}
    line_format json
    extract_kubernetes_labels true
    remove_keys kubernetes,stream,true,cluster,time,host
    <buffer>
      @type memory
      flush_interval 10s
      retry_max_interval 30
      retry_max_times 5
      retry_wait 1
      chunk_limit_size 200MB
      flush_at_shutdown true
    </buffer>
    <label>
      host $.host
      cluster $.cluster
    </label>
    slow_flush_log_threshold 60.0
  </store>
  <store>
    @type prometheus
    <metric>
      name fluentd_output_status_num_records_total
      type counter
      desc The total number of outgoing records
    </metric>
  </store>
</match>


# Only file
<filter goto.{flat,json}.noaudit.file.**>
  @type record_transformer
  <record>
    cluster ${tag_parts[5]}
    type ${tag_parts[4]}
  </record>
</filter>

<match goto.{flat,json}.noaudit.file.kubernetes.**>
  @type copy
  <store ignore_error>
    @id noaudit.file.kube
    @type forward
    send_timeout 3s
    connect_timeout 1s
    # primary host
    <server>
      host fluentd-file-0
      port 24224
    </server>
    

    # use file buffer to buffer events on disks.
    <buffer>
      @type memory
      flush_interval 10s
      flush_thread_interval 10s
      flush_thread_count 8
      retry_max_interval 30
      retry_forever true
      chunk_limit_size 200MB
      flush_at_shutdown true
    </buffer>

    recover_wait 5s
    tls_insecure_mode true
    slow_flush_log_threshold 60.0
  </store>
  <store ignore_error>
    # grafana-loki
    @type loki
    url "http://#{ENV['LOKI_URL']}"
    line_format json
    remove_keys kubernetes,stream,true,time,host,tag,docker,cluster,enable_ruby
    <buffer>
      @type memory
      flush_interval 10s
      retry_max_interval 30
      retry_max_times 5
      retry_wait 1
      chunk_limit_size 200MB
      flush_at_shutdown true
    </buffer>
    <label>
      container_name $.kubernetes.container_name
      namespace_name $.kubernetes.namespace_name
      pod_name $.kubernetes.pod_name
      cluster $.cluster
      type $.type
    </label>
    slow_flush_log_threshold 60.0
  </store>
  <store>
    @type prometheus
    <metric>
      name fluentd_output_status_num_records_total
      type counter
      desc The total number of outgoing records
    </metric>
  </store>
</match>
# Only file
<match goto.{flat,json}.noaudit.file.**>
  @type copy
  <store ignore_error>
    @id noaudit.file.rest
    @type forward
    # primary host
    <server>
      host fluentd-file-0
      port 24224
    </server>
    send_timeout 3s
    connect_timeout 1s
    

    # use file buffer to buffer events on disks.
    <buffer>
      @type memory
      flush_interval 10s
      flush_thread_interval 10s
      flush_thread_count 8
      retry_max_interval 30
      retry_forever true
      chunk_limit_size 200MB
      flush_at_shutdown true
    </buffer>

    recover_wait 5s
    tls_insecure_mode true
    slow_flush_log_threshold 60.0
  </store>
  <store ignore_error>
    # grafana-loki
    @type loki
    url "http://#{ENV['LOKI_URL']}"
    line_format json
    remove_keys kubernetes,stream,true,time,host,tag,docker,cluster,enable_ruby
    <buffer>
      @type memory
      flush_interval 10s
      retry_max_interval 30
      retry_max_times 5
      retry_wait 1
      chunk_limit_size 200MB
      flush_at_shutdown true
    </buffer>
    <label>
      host $.docker.container_hostname
      container_name $.docker.name
      cluster $.cluster
      type $.type
    </label>
    slow_flush_log_threshold 60.0
  </store>
  <store>
    @type prometheus
    <metric>
      name fluentd_output_status_num_records_total
      type counter
      desc The total number of outgoing records
    </metric>
  </store>
</match>



# Only audit and file
<match goto.{flat,json}.audit.file.**>
  @type copy
  <store ignore_error>
    @id audit.file.audit
    @type forward
    # primary host
    <server>
      host fluentd-audit-0
      port 24224
    </server>
    send_timeout 3s
    connect_timeout 1s
    

    # use file buffer to buffer events on disks.
    <buffer>
      @type memory
      flush_interval 10s
      flush_thread_interval 10s
      flush_thread_count 8
      retry_max_interval 30
      retry_forever true
      chunk_limit_size 200MB
      flush_at_shutdown true
    </buffer>

    recover_wait 5s
    tls_insecure_mode true
    slow_flush_log_threshold 60.0

  </store>
  <store ignore_error>
    @id audit.file
    @type forward
    # primary host
    <server>
      host fluentd-file-0
      port 24224
    </server>
    send_timeout 3s
    connect_timeout 1s
    

    # use file buffer to buffer events on disks.
    <buffer>
      @type memory
      flush_interval 10s
      flush_thread_interval 10s
      flush_thread_count 8
      retry_max_interval 30
      retry_forever true
      chunk_limit_size 200MB
      flush_at_shutdown true
    </buffer>

    recover_wait 5s
    tls_insecure_mode true
    slow_flush_log_threshold 60.0

  </store>
  <store>
    @type prometheus
    <metric>
      name fluentd_output_status_num_records_total
      type counter
      desc The total number of outgoing records
    </metric>
  </store>
</match>


# All other
<match **>
  @id default_match
  @type stdout
</match>

Your Error Log

SIGDUMP attahed in description.

Additional context

No response

Kamil10p avatar May 11 '22 09:05 Kamil10p

Switch agents from fluentd to fluentbit on the cluster with highest load (8-10k events/s) helped. Flunetd agent config: fluentd_agent.txt

Kamil10p avatar May 13 '22 06:05 Kamil10p

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

github-actions[bot] avatar Aug 11 '22 10:08 github-actions[bot]

This issue was automatically closed because of stale in 30 days

github-actions[bot] avatar Sep 11 '22 10:09 github-actions[bot]