fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

Fluentd sends duplicated logs with wildcard in path and follow_inodes enabled

Open bloodguard opened this issue 1 year ago • 0 comments

Describe the bug

Fluentd send logs again after log rotation was made even though it should track inodes

To Reproduce

  • Write logs with any logs agent that handles log rotation. Configure a max of 2 files. (in my case, was using telegraf 1.22 and writting in a file with max_archives of 1. The configuration is below. Can be reproduce with other agent like metricbeat)
  • configure fluentd to tail to those logs
  • wait for the rotation to happen
  • Fluentd will re read from start the newly rotated log file even though it just read it.

Expected behavior

Fluentd should be aware that the recently rotated log was already read. It should not reread log file that were rotated

Your Environment

- Fluentd version:1.14.6
- Operating system: Amazon linux 2
- telegraf: 1.22

Your Configuration

Fluentd configuration
`<source>
  @type tail
  @log_level debug
  tag metrics-default-statsd-telegraf.*
  path /var/log/telegraf/*.out
  pos_file /tmp/telegraf/metrics-default-statsd-telegraf.log.pos
  pos_file_compaction_interval 72h
  path_key source
  emit_unmatched_lines true
  read_from_head true
  follow_inodes true
  <parse>
    @type json
    time_key nil
  </parse>
</source>

<match metrics-**>
    @type forward
    # Sending logs somewhere
  </match>
`

Telegraf writting configuration:
`[[outputs.file]]
#   ## Files to write to, "stdout" is a specially handled file.
  files = ["/var/log/telegraf/metrics_json.out"]
#
#   ## Use batch serialization format instead of line based delimiting.  The
#   ## batch format allows for the production of non line based output formats and
#   ## may more effiently encode metric groups.
#   # use_batch_format = false
#
#   ## The file will be rotated after the time interval specified.  When set
#   ## to 0 no time based rotation is performed.
  rotation_interval = "24h"
#
#   ## The logfile will be rotated when it becomes larger than the specified
#   ## size.  When set to 0 no size based rotation is performed.
  rotation_max_size = "50KB"
#
#   ## Maximum number of rotated archives to keep, any older logs are deleted.
#   ## If set to -1, no archives are removed.
  rotation_max_archives = 1
#
#   ## Data format to output.
#   ## Each data format has its own unique set of configuration options, read
#   ## more about them here:
#   ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md
  data_format = "json"
#
#   ## The default timestamp format is Unix epoch time, subject to the
#   ## resolution configured in json_timestamp_units.
#   ## Other timestamp layout can be configured using the Go language time
#   ## layout specification from https://golang.org/pkg/time/#Time.Format
#   ## e.g.: json_timestamp_format = "2006-01-02T15:04:05Z07:00"
  json_timestamp_format = "2006-01-02T15:04:05Z"`

Your Error Log

Duplicated logs.

Additional context

I saw that the issue was also reported by multiple users. I did take a look at the log and checked the code to understand what may happen. Here is my theory:

Fluentd starts
-----------------------refresh_watchers------------------------
target_paths_hash [12593339,12593323]
existence_paths_hash []
hash[12593339] = metrics_json.out
hash[12593323] = metrics_json.2022-08-22-1661183690.out
===================================30 seconds later==============================
+++++++++++++++++++++++telegraf rotates+++++++++++++++++++++++++
hash[12593342] = metrics_json.out
hash[12593339] = metrics_json.2022-08-22-1661183820.out
========================fluentd follow for 5 seconds then delete====================
will follow 12593339 then delete
===================================25 seconds later==============================
-----------------------------refresh_watchers-------------------------
target_paths_hash [12593342,12593339]
existence_paths_hash [12593342]
hash[12593342] = metrics_json.out
hash[12593339] = metrics_json.2022-08-22-1661183820.out

-------Fluentd reads metrics_json.2022-08-22-1661183820.out, which ends up with duplicated logs---------

Seems because fluentd deletes the old entry after the wait is done. Now, the reflex would be to change the path of the tail to:

/var/log/telegraf/metrics_json.out problem is, if we were to stop fluentd , or that fluentd has some network issue, the logs files keep rotating. That means that during downtimes, using only metrics_json.out, some logs/metrics will be miss...which is a bit problematic... We can also not change the log rotation logique of some of these agents. I believe fluentd should be handle such a case. I dont know how fluentd should be able to handle that, but it should. Maybe after a rotation, it checks if the path in source matches the newly rotated file and if yes, do no apply the function detach_watcher_after_rotate_wait ?

bloodguard avatar Aug 23 '22 13:08 bloodguard