fluentd
fluentd copied to clipboard
Threads being blocked in forward input plugin
Describe the bug
We have configuration where fluentbit + fluentd agents are sending logs to flunetd aggregators, the aggregators is forwarding the logs to loki or fluentd-file, which is saving the logs to file. We observed that aggregators is stopping working after some time. We have tested multi threaded and single threaded configuration and we have the same behavior on both. The thread is getting blocked afer some time (cpu utilization stuck on 100% per thread and we can't any new events in metrics for this thread) and when all threads get blocked, the instance is not responding at all to any connection. We have generated the sigdump for this issue:
sigdump.txt sigdump_reproduced.txt
To Reproduce
We were trying to reproduce this issue on other environment with the exact the same configuration and software stack but most of the tries without success. We have reproduced it only one time, but only for a while with multi threaded configuration.
Expected behavior
Threads are not getting blocked with time.
Your Environment
- Fluentd version: 1.14.6
- TD Agent version: 3.8.1
- Operating system: linux apline 3.13
- Kernel version: 4.15.0-176-generic
Your Configuration
Fluentd.conf:
<system>
workers 5
</system>
# Input source
<source>
@id input_source
@type forward
port 24224
skip_invalid_event true
<transport tcp>
linger_timeout 1
</transport>
</source>
<label @FLUENT_LOG>
<match fluent.**>
@type stdout
</match>
</label>
# Monitoring
<source>
@type prometheus
bind 0.0.0.0
port 24231
metrics_path /metrics
</source>
<source>
@type prometheus_output_monitor
interval 10
<labels>
host ${hostname}
</labels>
</source>
<filter goto.** kube.** syslog.** audit.**}>
@type prometheus
<metric>
name fluentd_input_status_num_records_total
type counter
desc The total number of incoming records
</metric>
</filter>
<match kube.**>
@type copy
<store ignore_error>
@id fluent-bit.kube.file
@type forward
send_timeout 3s
connect_timeout 1s
# primary host
<server>
host fluentd-file-0
port 24224
</server>
# use file buffer to buffer events on disks.
<buffer>
@type memory
flush_interval 10s
flush_thread_count 8
retry_max_interval 30
retry_forever true
chunk_limit_size 200MB
</buffer>
recover_wait 5s
tls_insecure_mode true
slow_flush_log_threshold 60.0
</store>
<store ignore_error>
# grafana-loki
@type loki
url "http://#{ENV['LOKI_URL']}"
extra_labels {"type":"kubernetes"}
line_format json
extract_kubernetes_labels true
remove_keys kubernetes,stream,true,cluster,time,host
<buffer>
@type memory
flush_interval 10s
retry_max_interval 30
retry_max_times 5
retry_wait 1
chunk_limit_size 200MB
flush_at_shutdown true
</buffer>
<label>
container_name $.kubernetes.container_name
namespace_name $.kubernetes.namespace_name
pod_name $.kubernetes.pod_name
host $.host
cluster $.cluster
</label>
slow_flush_log_threshold 60.0
</store>
<store>
@type prometheus
<metric>
name fluentd_output_status_num_records_total
type counter
desc The total number of outgoing records
</metric>
</store>
</match>
<match syslog.**>
@type copy
<store ignore_error>
@id fluent-bit.syslog.file
@type forward
# primary host
<server>
host fluentd-file-0
port 24224
</server>
send_timeout 3s
connect_timeout 1s
# use file buffer to buffer events on disks.
<buffer>
@type memory
flush_interval 10s
flush_thread_interval 10s
flush_thread_count 8
retry_max_interval 30
retry_forever true
chunk_limit_size 200MB
flush_at_shutdown true
</buffer>
recover_wait 5s
tls_insecure_mode true
slow_flush_log_threshold 60.0
</store>
<store ignore_error>
# grafana-loki
@type loki
url "http://#{ENV['LOKI_URL']}"
extra_labels {"type":"syslog"}
line_format json
extract_kubernetes_labels true
remove_keys kubernetes,stream,true,cluster,time,host
<buffer>
@type memory
flush_interval 10s
retry_max_interval 30
retry_max_times 5
retry_wait 1
chunk_limit_size 200MB
flush_at_shutdown true
</buffer>
<label>
host $.host
cluster $.cluster
</label>
slow_flush_log_threshold 60.0
</store>
<store>
@type prometheus
<metric>
name fluentd_output_status_num_records_total
type counter
desc The total number of outgoing records
</metric>
</store>
</match>
<match audit.**>
@type copy
<store ignore_error>
@id fluent-bit.audit.file
@type forward
# primary host
<server>
host fluentd-file-0
port 24224
</server>
send_timeout 3s
connect_timeout 1s
# use file buffer to buffer events on disks.
<buffer>
@type memory
flush_interval 10s
flush_thread_interval 10s
flush_thread_count 8
retry_max_interval 30
retry_forever true
chunk_limit_size 200MB
flush_at_shutdown true
</buffer>
recover_wait 5s
tls_insecure_mode true
slow_flush_log_threshold 60.0
</store>
<store ignore_error>
# grafana-loki
@type loki
url "http://#{ENV['LOKI_URL']}"
extra_labels {"type":"audit"}
line_format json
extract_kubernetes_labels true
remove_keys kubernetes,stream,true,cluster,time,host
<buffer>
@type memory
flush_interval 10s
retry_max_interval 30
retry_max_times 5
retry_wait 1
chunk_limit_size 200MB
flush_at_shutdown true
</buffer>
<label>
host $.host
cluster $.cluster
</label>
slow_flush_log_threshold 60.0
</store>
<store>
@type prometheus
<metric>
name fluentd_output_status_num_records_total
type counter
desc The total number of outgoing records
</metric>
</store>
</match>
# Only file
<filter goto.{flat,json}.noaudit.file.**>
@type record_transformer
<record>
cluster ${tag_parts[5]}
type ${tag_parts[4]}
</record>
</filter>
<match goto.{flat,json}.noaudit.file.kubernetes.**>
@type copy
<store ignore_error>
@id noaudit.file.kube
@type forward
send_timeout 3s
connect_timeout 1s
# primary host
<server>
host fluentd-file-0
port 24224
</server>
# use file buffer to buffer events on disks.
<buffer>
@type memory
flush_interval 10s
flush_thread_interval 10s
flush_thread_count 8
retry_max_interval 30
retry_forever true
chunk_limit_size 200MB
flush_at_shutdown true
</buffer>
recover_wait 5s
tls_insecure_mode true
slow_flush_log_threshold 60.0
</store>
<store ignore_error>
# grafana-loki
@type loki
url "http://#{ENV['LOKI_URL']}"
line_format json
remove_keys kubernetes,stream,true,time,host,tag,docker,cluster,enable_ruby
<buffer>
@type memory
flush_interval 10s
retry_max_interval 30
retry_max_times 5
retry_wait 1
chunk_limit_size 200MB
flush_at_shutdown true
</buffer>
<label>
container_name $.kubernetes.container_name
namespace_name $.kubernetes.namespace_name
pod_name $.kubernetes.pod_name
cluster $.cluster
type $.type
</label>
slow_flush_log_threshold 60.0
</store>
<store>
@type prometheus
<metric>
name fluentd_output_status_num_records_total
type counter
desc The total number of outgoing records
</metric>
</store>
</match>
# Only file
<match goto.{flat,json}.noaudit.file.**>
@type copy
<store ignore_error>
@id noaudit.file.rest
@type forward
# primary host
<server>
host fluentd-file-0
port 24224
</server>
send_timeout 3s
connect_timeout 1s
# use file buffer to buffer events on disks.
<buffer>
@type memory
flush_interval 10s
flush_thread_interval 10s
flush_thread_count 8
retry_max_interval 30
retry_forever true
chunk_limit_size 200MB
flush_at_shutdown true
</buffer>
recover_wait 5s
tls_insecure_mode true
slow_flush_log_threshold 60.0
</store>
<store ignore_error>
# grafana-loki
@type loki
url "http://#{ENV['LOKI_URL']}"
line_format json
remove_keys kubernetes,stream,true,time,host,tag,docker,cluster,enable_ruby
<buffer>
@type memory
flush_interval 10s
retry_max_interval 30
retry_max_times 5
retry_wait 1
chunk_limit_size 200MB
flush_at_shutdown true
</buffer>
<label>
host $.docker.container_hostname
container_name $.docker.name
cluster $.cluster
type $.type
</label>
slow_flush_log_threshold 60.0
</store>
<store>
@type prometheus
<metric>
name fluentd_output_status_num_records_total
type counter
desc The total number of outgoing records
</metric>
</store>
</match>
# Only audit and file
<match goto.{flat,json}.audit.file.**>
@type copy
<store ignore_error>
@id audit.file.audit
@type forward
# primary host
<server>
host fluentd-audit-0
port 24224
</server>
send_timeout 3s
connect_timeout 1s
# use file buffer to buffer events on disks.
<buffer>
@type memory
flush_interval 10s
flush_thread_interval 10s
flush_thread_count 8
retry_max_interval 30
retry_forever true
chunk_limit_size 200MB
flush_at_shutdown true
</buffer>
recover_wait 5s
tls_insecure_mode true
slow_flush_log_threshold 60.0
</store>
<store ignore_error>
@id audit.file
@type forward
# primary host
<server>
host fluentd-file-0
port 24224
</server>
send_timeout 3s
connect_timeout 1s
# use file buffer to buffer events on disks.
<buffer>
@type memory
flush_interval 10s
flush_thread_interval 10s
flush_thread_count 8
retry_max_interval 30
retry_forever true
chunk_limit_size 200MB
flush_at_shutdown true
</buffer>
recover_wait 5s
tls_insecure_mode true
slow_flush_log_threshold 60.0
</store>
<store>
@type prometheus
<metric>
name fluentd_output_status_num_records_total
type counter
desc The total number of outgoing records
</metric>
</store>
</match>
# All other
<match **>
@id default_match
@type stdout
</match>
Your Error Log
SIGDUMP attahed in description.
Additional context
No response
Switch agents from fluentd to fluentbit on the cluster with highest load (8-10k events/s) helped. Flunetd agent config: fluentd_agent.txt
This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days
This issue was automatically closed because of stale in 30 days