fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

Timeout flush: tail.containers.var.log.containers

Open fedextm opened this issue 2 years ago • 1 comments

Describe the bug

Dear friends. We have some interesting bug with fluentd. We have a kubernenetes cluster with fluentd agent runnig on each node running across the cluster. Fluentd collects logs from containers while parsing /var/log/containers folder with tail plugin. And then send it to our Splunk System. When system genererates huge amount of logs. Tail plugin write an error in its log with - "Timeout flush: tail.containers.var.log.containers". In the same time on the system ruby proccess consume about 100% one CPU core. And logs stops sending to our splunk system. Can you fix this bug? Or may be do you have some guide how to setup fluentd in heavy load environment?

To Reproduce

Run fluentd with tail plugin on some kubernetes cluster. Generate huge amount of logs (about 20k string per second).

Expected behavior

Run smooth on heavy workload. Don't stop sending logs.

Your Environment

- Fluentd version: fluentd 1.14.6
- TD Agent version: - 
- Operating system: Red Hat Enterprise Linux release 8.5 (Ootpa)
- Kernel version: Linux 4.18.0-305.30.1.el8_4.x86_64

Your Configuration

#Events are emitted to the CONCAT label from the container, file and journald sources for multiline processing.
<label @CONCAT>
  # = concat xml and partial logs =
  <filter tail.containers.**>
    @type concat
    key log
    partial_key logtag
    partial_value P
    separator ''
  </filter>
  <filter tail.containers.**>
    @type concat
    key log
    multiline_start_regexp /^(\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2}.\d{0,3})/
    flush_interval 5s
    separator ""
    timeout_label @SPLUNK
  </filter>
  # = filters for container logs =
  <filter tail.containers.var.log.containers.dns-controller*dns-controller*.log>
    @type concat
    key log
    timeout_label @SPLUNK
    stream_identity_key stream
    multiline_start_regexp /^\w[0-1]\d[0-3]\d/
    flush_interval 5s
    separator ""
    use_first_timestamp true
  </filter>
  <filter tail.containers.var.log.containers.kube-dns*sidecar*.log>
    @type concat
    key log
    timeout_label @SPLUNK
    stream_identity_key stream
    multiline_start_regexp /^\w[0-1]\d[0-3]\d/
    flush_interval 5s
    separator ""
    use_first_timestamp true
  </filter>
  <filter tail.containers.var.log.containers.kube-dns*dnsmasq*.log>
    @type concat
    key log
    timeout_label @SPLUNK
    stream_identity_key stream
    multiline_start_regexp /^\w[0-1]\d[0-3]\d/
    flush_interval 5s
    separator ""
    use_first_timestamp true
  </filter>
  <filter tail.containers.var.log.containers.kube-apiserver*kube-apiserver*.log>
    @type concat
    key log
    timeout_label @SPLUNK
    stream_identity_key stream
    multiline_start_regexp /^\w[0-1]\d[0-3]\d/
    flush_interval 5s
    separator ""
    use_first_timestamp true
  </filter>
  <filter tail.containers.var.log.containers.kube-controller-manager*kube-controller-manager*.log>
    @type concat
    key log
    timeout_label @SPLUNK
    stream_identity_key stream
    multiline_start_regexp /^\w[0-1]\d[0-3]\d/
    flush_interval 5s
    separator ""
    use_first_timestamp true
  </filter>
  <filter tail.containers.var.log.containers.kube-dns-autoscaler*autoscaler*.log>
    @type concat
    key log
    timeout_label @SPLUNK
    stream_identity_key stream
    multiline_start_regexp /^\w[0-1]\d[0-3]\d/
    flush_interval 5s
    separator ""
    use_first_timestamp true
  </filter>
  <filter tail.containers.var.log.containers.kube-proxy*kube-proxy*.log>
    @type concat
    key log
    timeout_label @SPLUNK
    stream_identity_key stream
    multiline_start_regexp /^\w[0-1]\d[0-3]\d/
    flush_interval 5s
    separator ""
    use_first_timestamp true
  </filter>
  <filter tail.containers.var.log.containers.kube-scheduler*kube-scheduler*.log>
    @type concat
    key log
    timeout_label @SPLUNK
    stream_identity_key stream
    multiline_start_regexp /^\w[0-1]\d[0-3]\d/
    flush_interval 5s
    separator ""
    use_first_timestamp true
  </filter>
  <filter tail.containers.var.log.containers.kube-dns*kubedns*.log>
    @type concat
    key log
    timeout_label @SPLUNK
    stream_identity_key stream
    multiline_start_regexp /^\w[0-1]\d[0-3]\d/
    flush_interval 5s
    separator ""
    use_first_timestamp true
  </filter>
  # = filters for journald logs =
  <filter journald.kube:kubelet>
    @type concat
    key log
    timeout_label @SPLUNK
    multiline_start_regexp /^\w[0-1]\d[0-3]\d/
    flush_interval 5s
  </filter>

  # Events are relabeled then emitted to the SPLUNK label
  <match **>
    @type relabel
    @label @SPLUNK
  </match>
</label>

<label @SPLUNK>
  # Enrich log with k8s metadata
  <filter tail.containers.**>
    @type kubernetes_metadata
    annotation_match [ "^splunk\.com" ]
    de_dot false
  </filter>
  <filter tail.containers.**>
    @type record_transformer
    enable_ruby
    <record>
      # set the sourcetype from splunk.com/sourcetype pod annotation or set it to kube:container:CONTAINER_NAME
      sourcetype ${record.dig("kubernetes", "annotations", "splunk.com/sourcetype") ? "kube:"+record.dig("kubernetes", "annotations", "splunk.com/sourcetype") : "kube:container:"+record.dig("kubernetes","container_name")}
      container_name ${record.dig("kubernetes","container_name")}
      namespace ${record.dig("kubernetes","namespace_name")}
      pod ${record.dig("kubernetes","pod_name")}
      container_id ${record.dig("docker","container_id")}
      pod_uid ${record.dig("kubernetes","pod_id")}
      container_image ${record.dig("kubernetes","container_image")}
      # set the cluster_name field to the configured value, or default to "cluster_name"
      cluster_name "ocp-stage"
      # set the index field to the value found in the pod splunk.com/index annotations. if not set, use namespace annotation, or default to the default_index
      splunk_index ${record.dig("kubernetes", "annotations", "splunk.com/index") ? record.dig("kubernetes", "annotations", "splunk.com/index") : record.dig("kubernetes", "namespace_annotations", "splunk.com/index") ? (record["kubernetes"]["namespace_annotations"]["splunk.com/index"]) : ("ocp-stage_infra")}
      label_app ${record.dig("kubernetes","labels","app")}
      label_k8s-app ${record.dig("kubernetes","labels","k8s-app")}
      label_release ${record.dig("kubernetes","labels","release")}
      label_deploymentconfig ${record.dig("kubernetes","labels","deploymentconfig")}
      blacklist ${record.dig("kubernetes", "annotations", "splunk.com/exclude") ? record.dig("kubernetes", "annotations", "splunk.com/exclude") : record.dig("kubernetes", "namespace_annotations", "splunk.com/exclude") ? (record["kubernetes"]["namespace_annotations"]["splunk.com/exclude"]) : ("false")}
    </record>
  </filter>
  <filter tail.containers.**>
    # Exclude all logs that are blacklisted
    @type grep
    <exclude>
      key blacklist
      pattern /^true$/
    </exclude>
  </filter>
  # extract pod_uid and container_name for CRIO runtime
  <filter tail.containers.var.log.pods.**>
   @type jq_transformer
   jq '.record | . + (.source | capture("/var/log/pods/(?<pod_uid>[^/]+)/(?<container_name>[^/]+)/(?<container_retry>[0-9]+).log")) | .sourcetype = ("kube:container:" + .container_name) | .splunk_index = "ocp-stage_infra"'
   </filter>
  # create source and sourcetype
  <filter journald.**>
    @type jq_transformer
    jq '.record.source = "/var/log/journal/" + .record.source | .record.sourcetype = (.tag | ltrimstr("journald.")) | .record.cluster_name = "ocp-stage" | .record.splunk_index = "ocp-stage_infra" | .record'
  </filter>
  # = filters for non-container log files =
  # = audit.log =
  <filter tail.rhcos:auditd>
  @type jq_transformer
  jq '.record.sourcetype = "rhcos:auditd" | .record.cluster_name = "ocp-stage" | .record.splunk_index = "auditd_ocp_elec" | .record'
  </filter>
  # extract sourcetype
  <filter tail.file.**>
    @type jq_transformer
    jq '.record.sourcetype = (.tag | ltrimstr("tail.file.")) | .record.cluster_name = "ocp-stage" | .record.splunk_index = "ocp-stage_infra" | .record'
  </filter>

  # = filters for monitor agent =
  <filter monitor_agent>
    @type jq_transformer
    jq ".record.source = \"namespace:#{ENV['MY_NAMESPACE']}/pod:#{ENV['MY_POD_NAME']}\" | .record.sourcetype = \"fluentd:monitor-agent\" | .record.cluster_name = \"ocp-stage\" | .record.splunk_index = \"ocp-stage_infra\" | .record"
  </filter>

  # = custom filters specified by users =
  <filter **>
  @type record_modifier
  # set UTF-8 encoding information to string.
  char_encoding utf-8
  </filter>
  # = output =
  <match **>
    @type splunk_hec
     protocol https

     index_key splunk_index
     hec_host "splunk-hec.gu.local"
     hec_port 443
     hec_token "#{ENV['SPLUNK_HEC_TOKEN']}"
     host "#{ENV['NODE_NAME']}"
     coerce_to_utf8 true
     #non_utf8_replacement_string  "NONEUTF8"
     source_key source
     sourcetype_key sourcetype
     insecure_ssl true
    <fields>
      # currently CRI does not produce log paths with all the necessary
      # metadata to parse out pod, namespace, container_name, container_id.
      # this may be resolved in the future by this issue: https://github.com/kubernetes/kubernetes/issues/58638#issuecomment-385126031
      container_retry
      pod_uid
      pod
      container_name
      namespace
      container_id
      cluster_name
      label_app
      label_deploymentconfig
      label_k8s-app
      label_release
    </fields>
    <buffer>
      @type memory
      chunk_limit_records 100000
      chunk_limit_size 200m
      flush_interval 5s
      flush_thread_count 1
      overflow_action block
      retry_max_times 3
      total_limit_size 600m
    </buffer>
    <format monitor_agent>
      @type json
    </format>
    <format>
      # we just want to keep the raw logs, not the structure created by docker or journald
      @type single_value
      message_key log
      add_newline false
    </format>
  </match>

Your Error Log

2022-05-26 16:51:53 +0000 [info]: #0 Timeout flush: journald.kube:kubelet:default
2022-05-26 16:52:00 +0000 [info]: #0 [containers.log] detected rotation of /var/log/containers/logs-loadtestv1-5b446f974f-ddsvv_logs-loadtest_centos79-tools-374d2ea5bde65b78795ab66db899187b8690702b777146d2340a9a7156b9dd88.log; waiting 5 seconds
2022-05-26 16:52:00 +0000 [info]: #0 [containers.log] following tail of /var/log/containers/logs-loadtestv1-5b446f974f-ddsvv_logs-loadtest_centos79-tools-374d2ea5bde65b78795ab66db899187b8690702b777146d2340a9a7156b9dd88.log
2022-05-26 16:52:04 +0000 [info]: #0 Timeout flush: tail.containers.var.log.containers.logs-loadtestv1-5b446f974f-8pv65_logs-loadtest_centos79-tools-2c7caf05f425f4109620561725917d4558e60901cb85a5e1b6ca1f2988e9390e.log:default
2022-05-26 16:52:05 +0000 [info]: #0 stats - namespace_cache_size: 6, pod_cache_size: 1000, pod_cache_watch_updates: 11269, pod_cache_watch_misses: 82514, pod_cache_watch_delete_ignored: 6603, pod_cache_watch_ignored: 6669, namespace_cache_watch_updates: 1189, namespace_cache_api_updates: 95, pod_cache_api_updates: 122, id_cache_miss: 95, namespace_cache_watch_misses: 2, namespace_cache_watch_ignored: 2, pod_cache_host_updates: 4671, pod_cache_miss: 3
2022-05-26 16:52:06 +0000 [info]: #0 [containers.log] detected rotation of /var/log/containers/logs-loadtestv1-5b446f974f-2skzp_logs-loadtest_centos79-tools-9e81eaa01c06fb1641054876af9658d0104bcb63c86a1f9228a0e487f6d93f22.log; waiting 5 seconds
2022-05-26 16:52:06 +0000 [info]: #0 [containers.log] following tail of /var/log/containers/logs-loadtestv1-5b446f974f-2skzp_logs-loadtest_centos79-tools-9e81eaa01c06fb1641054876af9658d0104bcb63c86a1f9228a0e487f6d93f22.log
2022-05-26 16:52:08 +0000 [info]: #0 Timeout flush: tail.containers.var.log.containers.logs-loadtestv1-5b446f974f-ddsvv_logs-loadtest_centos79-tools-374d2ea5bde65b78795ab66db899187b8690702b777146d2340a9a7156b9dd88.log:default

Additional context

No response

fedextm avatar May 26 '22 17:05 fedextm

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

github-actions[bot] avatar Aug 25 '22 10:08 github-actions[bot]

This issue was automatically closed because of stale in 30 days

github-actions[bot] avatar Sep 24 '22 10:09 github-actions[bot]