v1.2-debian-cloudwatch crashes where v0.12 works (k8s 1.8.13 CoreOS 17455.0)

Open whereisaaron opened this issue 7 years ago • 1 comments

Deploying for Kubernetes 1.8.13 on CoreOS 1745.5.0 using fluent/fluentd-kubernetes-daemonset

Deploying with v0.12-debian-cloudwatch works great as in the past, however switching to v1.2-debian-cloudwatch and every Pod on every node crash after ~1 minute of run time. Occasionally they get to create a log-flow and even log some entries first, but they always crash. The kept getting restarted but they just crash again. They keep in time too, so after a while they all have exactly e.g. 12 crashes, so I am guess they run the same amount of time before crashing.

Everything else about the config remains unchanged. I wondered if Debian needed more memory so I removed that limit, but an every node in the cluster the container would still run for maybe a minute and then crash.

2018-06-15 21:38:39 +0000 [info]: parsing config file is succeeded path="/fluentd/etc/fluent.conf"
2018-06-15 21:38:46 +0000 [info]: using configuration file: <ROOT>
  <match fluent.**>
    @type null
  </match>
  <source>
    @type tail
    path "/var/log/containers/*.log"
    pos_file "/var/log/fluentd-containers.log.pos"
    time_format %Y-%m-%dT%H:%M:%S.%NZ
    tag "kubernetes.*"
    format json
    read_from_head true
    <parse>
      time_format %Y-%m-%dT%H:%M:%S.%NZ
      @type json
      time_type string
    </parse>
  </source>
  <filter kubernetes.**>
    @type kubernetes_metadata
  </filter>
  <filter kubernetes.**>
    @type record_transformer
    enable_ruby true
    <record>
      kubehost ${record.fetch("kubernetes", Hash.new).fetch("host", "unknown_host")}
    </record>
  </filter>
  <match kubernetes.**>
    @type cloudwatch_logs
    log_group_name "anthill-cluster-containers"
    log_stream_name_key "kubehost"
    remove_log_group_name_key true
    auto_create_stream true
    put_log_events_retry_limit 20
  </match>
</ROOT>
2018-06-15 21:38:46 +0000 [info]: starting fluentd-1.2.2 pid=5 ruby="2.3.3"
2018-06-15 21:38:46 +0000 [info]: spawn command to main:  cmdline=["/usr/bin/ruby2.3", "-Eascii-8bit:ascii-8bit", "/fluentd/vendor/bundle/ruby/2.3.0/bin/fluentd", "-c", "/fluentd/etc/fluent.conf", "-p", "/fluentd/plugins", "--gemfile", "/fluentd/Gemfile", "--under-supervisor"]
2018-06-15 21:38:50 +0000 [info]: gem 'fluent-plugin-cloudwatch-logs' version '0.5.0'
2018-06-15 21:38:50 +0000 [info]: gem 'fluent-plugin-kubernetes_metadata_filter' version '2.1.2'
2018-06-15 21:38:50 +0000 [info]: gem 'fluent-plugin-systemd' version '1.0.1'
2018-06-15 21:38:50 +0000 [info]: gem 'fluentd' version '1.2.2'
2018-06-15 21:38:50 +0000 [info]: adding match pattern="fluent.**" type="null"
2018-06-15 21:38:50 +0000 [info]: adding filter pattern="kubernetes.**" type="kubernetes_metadata"
2018-06-15 21:38:54 +0000 [info]: adding filter pattern="kubernetes.**" type="record_transformer"
2018-06-15 21:38:54 +0000 [info]: adding match pattern="kubernetes.**" type="cloudwatch_logs"
2018-06-15 21:38:57 +0000 [info]: adding source type="tail"
2018-06-15 21:38:57 +0000 [info]: #0 starting fluentd worker pid=16 ppid=5 worker=0
2018-06-15 21:38:57 +0000 [info]: #0 following tail of /var/log/containers/kube-prometheus-exporter-node-fwnkt_prometheus_node-exporter-1412af047f962327fb4e3f7949fac5028ae156606e68d064240a78d37fd8af65.log
2018-06-15 21:38:57 +0000 [info]: #0 following tail of /var/log/containers/kube-node-drainer-ds-bghgj_kube-system_main-7a733ef08fe677ea9c3998026c6e3149b30ffbf031c9ddfba8450dcb9ce8dae6.log
2018-06-15 21:38:57 +0000 [info]: #0 disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::RecordTransformerFilter] uses `#filter_stream` method.

My config:

  <match fluent.**>
    @type null
  </match>

  <source>
    @type tail
    path /var/log/containers/*.log
    pos_file /var/log/fluentd-containers.log.pos
    time_format %Y-%m-%dT%H:%M:%S.%NZ
    tag kubernetes.*
    format json
    read_from_head true
  </source>

  <filter kubernetes.**>
    @type kubernetes_metadata
  </filter>

  <filter kubernetes.**>
    @type record_transformer
    enable_ruby true
    <record>
      kubehost ${record.fetch("kubernetes", Hash.new).fetch("host", "unknown_host")}
    </record>
  </filter>

  <match kubernetes.**>
    @type cloudwatch_logs
    log_group_name "#{ENV['LOG_GROUP_NAME']}"
    log_stream_name_key kubehost
    remove_log_group_name_key true
    auto_create_stream true
    put_log_events_retry_limit 20
  </match>

Jun 15 '18 22:06 whereisaaron

Does anyone have an good idea for this issue? Vanilla fluentd v1.2 doesn't have this issue so we want to know what is the problem.

Jul 06 '18 23:07 repeatedly