fluentd File watchers might not be handled properly causing gradual increase in CPU/Memory usage

Describe the bug

Fluentd tail plugin was outputting If you keep getting this message, please restart Fluentd. After coming across https://github.com/fluent/fluentd/issues/3614, we implemented the workaround suggested there.

changed follow_inodes to true
set rotate_wait to 0

Since than we are not seeing the original If you keep getting this message, please restart Fluentd but still seeing lots of Skip update_watcher because watcher has been already updated by other inotify event. This is paired with a pattern of memory leaking and gradual increase in CPU usage until a restart occurs.

To mitigate this I added pos_file_compaction_interval 20m as suggested here but this had no affect on the resource usage.

Related to https://github.com/fluent/fluentd/issues/3614. More specifically https://github.com/fluent/fluentd/issues/3614#issuecomment-1871484810

The suspicion is that some Watchers are not handled properly thus leaking and increasing CPU/Memory consumption until the next restart.

To Reproduce

Deploy fluentd (version v1.16.3-debian-forward-1.0) as a daemonset in a dynamic kubernetes cluster. Cluster is consisting of 50-100 nodes. This is the fluentd config:

Expected behavior

CPU / Memory should stay stable.

Your Environment

- Fluentd version: [v1.16.3-debian-forward-1.0](https://github.com/fluent/fluentd-kubernetes-daemonset#:~:text=debian%2Dcloudwatch%2D1-,Forward,-docker%20pull%20fluent)

Your Configuration

<source>
  @type tail
  @id in_tail_container_logs
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  follow_inodes true
  rotate_wait 0
  exclude_path ["/var/log/containers/fluentd*.log", "/var/log/containers/*kube-system*.log", "/var/log/containers/*calico-system*.log", "/var/log/containers/prometheus-node-exporter*.log", "/var/log/containers/opentelemetry-agent*.log"]
  pos_file_compaction_interval 20m
  <parse>
    @type multi_format
    <pattern>
      format json
      time_key time
      time_type string
      time_format "%Y-%m-%dT%H:%M:%S.%NZ"
      keep_time_key true
    </pattern>
    <pattern>
      format /^(?<time>.+?) (?<stream>stdout|stderr) (?<logtag>[FP]) (?<log>.+)$/
      time_format "%Y-%m-%dT%H:%M:%S.%N%:z"
    </pattern>
  </parse>
  emit_unmatched_lines true
</source>



### Your Error Log

```shell
Skip update_watcher because watcher has been already updated by other inotify event

Additional context

https://github.com/fluent/fluentd/issues/3614

Jan 15 '24 09:01 uristernik

Thanks for your report!

Fluentd tail plugin was outputting If you keep getting this message, please restart Fluentd. After coming across https://github.com/fluent/fluentd/issues/3614, we implemented the workaround suggested there.

changed follow_inodes to true

set rotate_wait to 0

So, follow_inodes false has a similar issue. Could you please report an issue of follow_inodes false in a new issue?

Jan 16 '24 09:01 daipom

@daipom In this case I had follow_inodes true

Do you want me to open a new issue just for tracking?

Jan 23 '24 09:01 uristernik

@uristernik Wasn't there a problem with follow_inodes false as well? I'd like to sort out each of follow_inodes false problem and follow_inodes true problem.

I'd like to know if there is any difference between follow_inodes false and follow_inodes true. For example, whether the same resource leakage occurs when follow_inodes false.

If there is no particular difference, we are fine with this for now. Thanks!

Jan 23 '24 10:01 daipom

fluentd fluentd copied to clipboard

File watchers might not be handled properly causing gradual increase in CPU/Memory usage

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Additional context

fluentd
fluentd copied to clipboard