fluentd
fluentd copied to clipboard
Timeout flush: tail.containers.var.log.containers
Describe the bug
Dear friends. We have some interesting bug with fluentd. We have a kubernenetes cluster with fluentd agent runnig on each node running across the cluster. Fluentd collects logs from containers while parsing /var/log/containers folder with tail plugin. And then send it to our Splunk System. When system genererates huge amount of logs. Tail plugin write an error in its log with - "Timeout flush: tail.containers.var.log.containers". In the same time on the system ruby proccess consume about 100% one CPU core. And logs stops sending to our splunk system. Can you fix this bug? Or may be do you have some guide how to setup fluentd in heavy load environment?
To Reproduce
Run fluentd with tail plugin on some kubernetes cluster. Generate huge amount of logs (about 20k string per second).
Expected behavior
Run smooth on heavy workload. Don't stop sending logs.
Your Environment
- Fluentd version: fluentd 1.14.6
- TD Agent version: -
- Operating system: Red Hat Enterprise Linux release 8.5 (Ootpa)
- Kernel version: Linux 4.18.0-305.30.1.el8_4.x86_64
Your Configuration
#Events are emitted to the CONCAT label from the container, file and journald sources for multiline processing.
<label @CONCAT>
# = concat xml and partial logs =
<filter tail.containers.**>
@type concat
key log
partial_key logtag
partial_value P
separator ''
</filter>
<filter tail.containers.**>
@type concat
key log
multiline_start_regexp /^(\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2}.\d{0,3})/
flush_interval 5s
separator ""
timeout_label @SPLUNK
</filter>
# = filters for container logs =
<filter tail.containers.var.log.containers.dns-controller*dns-controller*.log>
@type concat
key log
timeout_label @SPLUNK
stream_identity_key stream
multiline_start_regexp /^\w[0-1]\d[0-3]\d/
flush_interval 5s
separator ""
use_first_timestamp true
</filter>
<filter tail.containers.var.log.containers.kube-dns*sidecar*.log>
@type concat
key log
timeout_label @SPLUNK
stream_identity_key stream
multiline_start_regexp /^\w[0-1]\d[0-3]\d/
flush_interval 5s
separator ""
use_first_timestamp true
</filter>
<filter tail.containers.var.log.containers.kube-dns*dnsmasq*.log>
@type concat
key log
timeout_label @SPLUNK
stream_identity_key stream
multiline_start_regexp /^\w[0-1]\d[0-3]\d/
flush_interval 5s
separator ""
use_first_timestamp true
</filter>
<filter tail.containers.var.log.containers.kube-apiserver*kube-apiserver*.log>
@type concat
key log
timeout_label @SPLUNK
stream_identity_key stream
multiline_start_regexp /^\w[0-1]\d[0-3]\d/
flush_interval 5s
separator ""
use_first_timestamp true
</filter>
<filter tail.containers.var.log.containers.kube-controller-manager*kube-controller-manager*.log>
@type concat
key log
timeout_label @SPLUNK
stream_identity_key stream
multiline_start_regexp /^\w[0-1]\d[0-3]\d/
flush_interval 5s
separator ""
use_first_timestamp true
</filter>
<filter tail.containers.var.log.containers.kube-dns-autoscaler*autoscaler*.log>
@type concat
key log
timeout_label @SPLUNK
stream_identity_key stream
multiline_start_regexp /^\w[0-1]\d[0-3]\d/
flush_interval 5s
separator ""
use_first_timestamp true
</filter>
<filter tail.containers.var.log.containers.kube-proxy*kube-proxy*.log>
@type concat
key log
timeout_label @SPLUNK
stream_identity_key stream
multiline_start_regexp /^\w[0-1]\d[0-3]\d/
flush_interval 5s
separator ""
use_first_timestamp true
</filter>
<filter tail.containers.var.log.containers.kube-scheduler*kube-scheduler*.log>
@type concat
key log
timeout_label @SPLUNK
stream_identity_key stream
multiline_start_regexp /^\w[0-1]\d[0-3]\d/
flush_interval 5s
separator ""
use_first_timestamp true
</filter>
<filter tail.containers.var.log.containers.kube-dns*kubedns*.log>
@type concat
key log
timeout_label @SPLUNK
stream_identity_key stream
multiline_start_regexp /^\w[0-1]\d[0-3]\d/
flush_interval 5s
separator ""
use_first_timestamp true
</filter>
# = filters for journald logs =
<filter journald.kube:kubelet>
@type concat
key log
timeout_label @SPLUNK
multiline_start_regexp /^\w[0-1]\d[0-3]\d/
flush_interval 5s
</filter>
# Events are relabeled then emitted to the SPLUNK label
<match **>
@type relabel
@label @SPLUNK
</match>
</label>
<label @SPLUNK>
# Enrich log with k8s metadata
<filter tail.containers.**>
@type kubernetes_metadata
annotation_match [ "^splunk\.com" ]
de_dot false
</filter>
<filter tail.containers.**>
@type record_transformer
enable_ruby
<record>
# set the sourcetype from splunk.com/sourcetype pod annotation or set it to kube:container:CONTAINER_NAME
sourcetype ${record.dig("kubernetes", "annotations", "splunk.com/sourcetype") ? "kube:"+record.dig("kubernetes", "annotations", "splunk.com/sourcetype") : "kube:container:"+record.dig("kubernetes","container_name")}
container_name ${record.dig("kubernetes","container_name")}
namespace ${record.dig("kubernetes","namespace_name")}
pod ${record.dig("kubernetes","pod_name")}
container_id ${record.dig("docker","container_id")}
pod_uid ${record.dig("kubernetes","pod_id")}
container_image ${record.dig("kubernetes","container_image")}
# set the cluster_name field to the configured value, or default to "cluster_name"
cluster_name "ocp-stage"
# set the index field to the value found in the pod splunk.com/index annotations. if not set, use namespace annotation, or default to the default_index
splunk_index ${record.dig("kubernetes", "annotations", "splunk.com/index") ? record.dig("kubernetes", "annotations", "splunk.com/index") : record.dig("kubernetes", "namespace_annotations", "splunk.com/index") ? (record["kubernetes"]["namespace_annotations"]["splunk.com/index"]) : ("ocp-stage_infra")}
label_app ${record.dig("kubernetes","labels","app")}
label_k8s-app ${record.dig("kubernetes","labels","k8s-app")}
label_release ${record.dig("kubernetes","labels","release")}
label_deploymentconfig ${record.dig("kubernetes","labels","deploymentconfig")}
blacklist ${record.dig("kubernetes", "annotations", "splunk.com/exclude") ? record.dig("kubernetes", "annotations", "splunk.com/exclude") : record.dig("kubernetes", "namespace_annotations", "splunk.com/exclude") ? (record["kubernetes"]["namespace_annotations"]["splunk.com/exclude"]) : ("false")}
</record>
</filter>
<filter tail.containers.**>
# Exclude all logs that are blacklisted
@type grep
<exclude>
key blacklist
pattern /^true$/
</exclude>
</filter>
# extract pod_uid and container_name for CRIO runtime
<filter tail.containers.var.log.pods.**>
@type jq_transformer
jq '.record | . + (.source | capture("/var/log/pods/(?<pod_uid>[^/]+)/(?<container_name>[^/]+)/(?<container_retry>[0-9]+).log")) | .sourcetype = ("kube:container:" + .container_name) | .splunk_index = "ocp-stage_infra"'
</filter>
# create source and sourcetype
<filter journald.**>
@type jq_transformer
jq '.record.source = "/var/log/journal/" + .record.source | .record.sourcetype = (.tag | ltrimstr("journald.")) | .record.cluster_name = "ocp-stage" | .record.splunk_index = "ocp-stage_infra" | .record'
</filter>
# = filters for non-container log files =
# = audit.log =
<filter tail.rhcos:auditd>
@type jq_transformer
jq '.record.sourcetype = "rhcos:auditd" | .record.cluster_name = "ocp-stage" | .record.splunk_index = "auditd_ocp_elec" | .record'
</filter>
# extract sourcetype
<filter tail.file.**>
@type jq_transformer
jq '.record.sourcetype = (.tag | ltrimstr("tail.file.")) | .record.cluster_name = "ocp-stage" | .record.splunk_index = "ocp-stage_infra" | .record'
</filter>
# = filters for monitor agent =
<filter monitor_agent>
@type jq_transformer
jq ".record.source = \"namespace:#{ENV['MY_NAMESPACE']}/pod:#{ENV['MY_POD_NAME']}\" | .record.sourcetype = \"fluentd:monitor-agent\" | .record.cluster_name = \"ocp-stage\" | .record.splunk_index = \"ocp-stage_infra\" | .record"
</filter>
# = custom filters specified by users =
<filter **>
@type record_modifier
# set UTF-8 encoding information to string.
char_encoding utf-8
</filter>
# = output =
<match **>
@type splunk_hec
protocol https
index_key splunk_index
hec_host "splunk-hec.gu.local"
hec_port 443
hec_token "#{ENV['SPLUNK_HEC_TOKEN']}"
host "#{ENV['NODE_NAME']}"
coerce_to_utf8 true
#non_utf8_replacement_string "NONEUTF8"
source_key source
sourcetype_key sourcetype
insecure_ssl true
<fields>
# currently CRI does not produce log paths with all the necessary
# metadata to parse out pod, namespace, container_name, container_id.
# this may be resolved in the future by this issue: https://github.com/kubernetes/kubernetes/issues/58638#issuecomment-385126031
container_retry
pod_uid
pod
container_name
namespace
container_id
cluster_name
label_app
label_deploymentconfig
label_k8s-app
label_release
</fields>
<buffer>
@type memory
chunk_limit_records 100000
chunk_limit_size 200m
flush_interval 5s
flush_thread_count 1
overflow_action block
retry_max_times 3
total_limit_size 600m
</buffer>
<format monitor_agent>
@type json
</format>
<format>
# we just want to keep the raw logs, not the structure created by docker or journald
@type single_value
message_key log
add_newline false
</format>
</match>
Your Error Log
2022-05-26 16:51:53 +0000 [info]: #0 Timeout flush: journald.kube:kubelet:default
2022-05-26 16:52:00 +0000 [info]: #0 [containers.log] detected rotation of /var/log/containers/logs-loadtestv1-5b446f974f-ddsvv_logs-loadtest_centos79-tools-374d2ea5bde65b78795ab66db899187b8690702b777146d2340a9a7156b9dd88.log; waiting 5 seconds
2022-05-26 16:52:00 +0000 [info]: #0 [containers.log] following tail of /var/log/containers/logs-loadtestv1-5b446f974f-ddsvv_logs-loadtest_centos79-tools-374d2ea5bde65b78795ab66db899187b8690702b777146d2340a9a7156b9dd88.log
2022-05-26 16:52:04 +0000 [info]: #0 Timeout flush: tail.containers.var.log.containers.logs-loadtestv1-5b446f974f-8pv65_logs-loadtest_centos79-tools-2c7caf05f425f4109620561725917d4558e60901cb85a5e1b6ca1f2988e9390e.log:default
2022-05-26 16:52:05 +0000 [info]: #0 stats - namespace_cache_size: 6, pod_cache_size: 1000, pod_cache_watch_updates: 11269, pod_cache_watch_misses: 82514, pod_cache_watch_delete_ignored: 6603, pod_cache_watch_ignored: 6669, namespace_cache_watch_updates: 1189, namespace_cache_api_updates: 95, pod_cache_api_updates: 122, id_cache_miss: 95, namespace_cache_watch_misses: 2, namespace_cache_watch_ignored: 2, pod_cache_host_updates: 4671, pod_cache_miss: 3
2022-05-26 16:52:06 +0000 [info]: #0 [containers.log] detected rotation of /var/log/containers/logs-loadtestv1-5b446f974f-2skzp_logs-loadtest_centos79-tools-9e81eaa01c06fb1641054876af9658d0104bcb63c86a1f9228a0e487f6d93f22.log; waiting 5 seconds
2022-05-26 16:52:06 +0000 [info]: #0 [containers.log] following tail of /var/log/containers/logs-loadtestv1-5b446f974f-2skzp_logs-loadtest_centos79-tools-9e81eaa01c06fb1641054876af9658d0104bcb63c86a1f9228a0e487f6d93f22.log
2022-05-26 16:52:08 +0000 [info]: #0 Timeout flush: tail.containers.var.log.containers.logs-loadtestv1-5b446f974f-ddsvv_logs-loadtest_centos79-tools-374d2ea5bde65b78795ab66db899187b8690702b777146d2340a9a7156b9dd88.log:default
Additional context
No response
This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days
This issue was automatically closed because of stale in 30 days