fluentd
fluentd copied to clipboard
Fluentd sends duplicated logs with wildcard in path and follow_inodes enabled
Describe the bug
Fluentd send logs again after log rotation was made even though it should track inodes
To Reproduce
- Write logs with any logs agent that handles log rotation. Configure a max of 2 files. (in my case, was using telegraf 1.22 and writting in a file with max_archives of 1. The configuration is below. Can be reproduce with other agent like metricbeat)
- configure fluentd to tail to those logs
- wait for the rotation to happen
- Fluentd will re read from start the newly rotated log file even though it just read it.
Expected behavior
Fluentd should be aware that the recently rotated log was already read. It should not reread log file that were rotated
Your Environment
- Fluentd version:1.14.6
- Operating system: Amazon linux 2
- telegraf: 1.22
Your Configuration
Fluentd configuration
`<source>
@type tail
@log_level debug
tag metrics-default-statsd-telegraf.*
path /var/log/telegraf/*.out
pos_file /tmp/telegraf/metrics-default-statsd-telegraf.log.pos
pos_file_compaction_interval 72h
path_key source
emit_unmatched_lines true
read_from_head true
follow_inodes true
<parse>
@type json
time_key nil
</parse>
</source>
<match metrics-**>
@type forward
# Sending logs somewhere
</match>
`
Telegraf writting configuration:
`[[outputs.file]]
# ## Files to write to, "stdout" is a specially handled file.
files = ["/var/log/telegraf/metrics_json.out"]
#
# ## Use batch serialization format instead of line based delimiting. The
# ## batch format allows for the production of non line based output formats and
# ## may more effiently encode metric groups.
# # use_batch_format = false
#
# ## The file will be rotated after the time interval specified. When set
# ## to 0 no time based rotation is performed.
rotation_interval = "24h"
#
# ## The logfile will be rotated when it becomes larger than the specified
# ## size. When set to 0 no size based rotation is performed.
rotation_max_size = "50KB"
#
# ## Maximum number of rotated archives to keep, any older logs are deleted.
# ## If set to -1, no archives are removed.
rotation_max_archives = 1
#
# ## Data format to output.
# ## Each data format has its own unique set of configuration options, read
# ## more about them here:
# ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md
data_format = "json"
#
# ## The default timestamp format is Unix epoch time, subject to the
# ## resolution configured in json_timestamp_units.
# ## Other timestamp layout can be configured using the Go language time
# ## layout specification from https://golang.org/pkg/time/#Time.Format
# ## e.g.: json_timestamp_format = "2006-01-02T15:04:05Z07:00"
json_timestamp_format = "2006-01-02T15:04:05Z"`
Your Error Log
Duplicated logs.
Additional context
I saw that the issue was also reported by multiple users. I did take a look at the log and checked the code to understand what may happen. Here is my theory:
Fluentd starts
-----------------------refresh_watchers------------------------
target_paths_hash [12593339,12593323]
existence_paths_hash []
hash[12593339] = metrics_json.out
hash[12593323] = metrics_json.2022-08-22-1661183690.out
===================================30 seconds later==============================
+++++++++++++++++++++++telegraf rotates+++++++++++++++++++++++++
hash[12593342] = metrics_json.out
hash[12593339] = metrics_json.2022-08-22-1661183820.out
========================fluentd follow for 5 seconds then delete====================
will follow 12593339 then delete
===================================25 seconds later==============================
-----------------------------refresh_watchers-------------------------
target_paths_hash [12593342,12593339]
existence_paths_hash [12593342]
hash[12593342] = metrics_json.out
hash[12593339] = metrics_json.2022-08-22-1661183820.out
-------Fluentd reads metrics_json.2022-08-22-1661183820.out, which ends up with duplicated logs---------
Seems because fluentd deletes the old entry after the wait is done. Now, the reflex would be to change the path of the tail to:
/var/log/telegraf/metrics_json.out
problem is, if we were to stop fluentd , or that fluentd has some network issue, the logs files keep rotating. That means that during downtimes, using only metrics_json.out, some logs/metrics will be miss...which is a bit problematic... We can also not change the log rotation logique of some of these agents. I believe fluentd should be handle such a case. I dont know how fluentd should be able to handle that, but it should. Maybe after a rotation, it checks if the path in source matches the newly rotated file and if yes, do no apply the function detach_watcher_after_rotate_wait
?