fluentd
fluentd copied to clipboard
Fluentd in_tail "unreadable" file causes "following tail of <file>" to stop and no logs pushed
Describe the bug
After a warning of an "unreadable" (likely due to rotation), no more logs were pushed (in_tail + pos_file). Reloading config or restarting fluentd sorts the issue. All other existing files being tracked continued to work as expected.
To Reproduce
Not able to reproduce at will.
Expected behavior
Logs to be pushed as usual after file rotation as fluentd recovers from the temporary "unreadable" file.
Your Environment
- Fluentd version: 1.14.4
- TD Agent version: 4.2.0
- Operating system: Ubuntu 20.04.3 LTS
- Kernel version: 5.11.0-1022-aws
Your Configuration
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/es-containers.log.pos
tag kubernetes.*
<parse>
@type regexp
expression /^(?<time>[^ ]+) (?<stream>[^ ]+) (?<logtag>[^ ]+) (?<log>.+)$/
time_key time
time_format %Y-%m-%dT%H:%M:%S.%N%z
</parse>
read_from_head true
</source>
Your Error Log
Relevant log entries with some context. When "detected rotation of ..." isn't followed by a "following tail of ..." then log file contents aren't being processed/pushed:
2022-02-01 01:26:33 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 01:26:33 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
2022-02-01 01:32:53 +0000 [warn]: #0 /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log unreadable. It is excluded and would be examined next time.
2022-02-01 01:32:54 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 01:32:54 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
2022-02-01 01:38:04 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 01:44:44 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 01:53:15 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
[...]
---- after issuing a config reload (kill -SIGUSR2 <pid>) it starts to work fine again, i.e. "following tail of ..."
2022-02-01 11:36:19 +0000 [info]: Reloading new config
2022-02-01 11:36:19 +0000 [info]: using configuration file: <ROOT>
[...]
2022-02-01 11:36:19 +0000 [info]: shutting down input plugin type=:tail plugin_id="object:c6c0"
[...]
2022-02-01 11:36:19 +0000 [info]: adding source type="tail"
2022-02-01 11:36:19 +0000 [info]: #0 shutting down fluentd worker worker=0
2022-02-01 11:36:19 +0000 [info]: #0 shutting down input plugin type=:tail plugin_id="object:c6c0"
[...]
2022-02-01 11:36:20 +0000 [info]: #0 restart fluentd worker worker=0
---- the entry below seems to be related with the actual underlying issue... the ruby object which stopped pushing logs has now been terminated as a new one was created
2022-02-01 11:36:20 +0000 [warn]: #0 /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log already exists. use latest one: deleted #<struct Fluent::Plugin::TailInput::Entry path="/var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log", pos=10740032, ino=1797715, seek=1530>
2022-02-01 11:36:20 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
2022-02-01 11:37:30 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 11:37:30 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
2022-02-01 11:43:20 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 11:43:20 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
Additional context
This issue seems be related with #3586 but unfortunately I didn't check the pos
file while the issue was happening so can't tell if it presented unexpected values for the failing file.
We have encountered the same problem twice. Last month we upgraded fluentd from 1.13.3 (td-agent 4.2.0) to 1.14.3 (td-agent 4.3.0), and then we have got the problem.
In our case, we are tracking more than 3 log files with the tail plugin, and only the file with the highest log flow has this problem. From the metrics produced by prometheus plugin, the file position always fluctuates by at least 400000/sec and the rotation happens at least once every 10 minutes.
The input source is set as follows:
<source>
@type tail
format ...
path /var/log/server.log
pos_file /var/log/td-agent/server.log.pos
tag input.server.log
</source>
I will also share the record of the investigation when the problem occurs. First, fluentd suddenly, without warning, stops tracking new rotated files and only logs the following in the stdout log.
detected rotation of /var/log/server.log; waiting 5 seconds
Normally, it would have been followed by an announcement that a new file was being tracked, as follows
The following tail of /var/log/server.log
However, we have not seen this announcement at all after the problem occurs. Also, when I view the file descriptors that fluentd is tracking, I see the following:
$ sudo ls "/proc/$(pgrep -f 'td-agent.*--under-supervisor')/fd" -l | grep server
lr-x------ 1 td-agent td-agent 64 Feb 5 01:41 49 -> /var/log/server.3.log
This is the rotated file, and we see that fluentd is still tracking the old file, not the new one. Also, looking at the position file, we see the following:
$ sudo cat /var/log/td-agent/server.log.pos
/var/log/server.log 000000002001f2d0 000000003a000042
/var/log/server.log 0000000000000000 0000000000000000
As far as the situation is concerned, it seems that something triggers fluentd to stop tracking new files altogether during rotate. We have about 300 servers running 24/7 and have only had two problems in the past month, so it seems to be pretty rare for problems to occur. However, since we haven't had any problems with fluentd 1.13.3 for 6 months, it seems likely that there was some kind of regression in fluentd 1.13.3 -> 1.14.3.
We are planning to downgrade fluentd.
Memo: From v1.14.0 to v1.14.3, following changes are made to in_tail:
- #3467
- #3478
- #3500
- #3504
- #3541
- #3542
We downgraded td-agent 4.3.0 (fluentd v1.14.3) to td-agent 4.2.0 (fluentd v1.13.3), and still have problems. We broadened monitoring targets to 500 servers (approx 1500 files) from 200 servers (approx 400 files) as a result of this issue. Then, we found the 2 servers (2 files) having this issue after a week from td-agent downgraded. Upgrading fluentd 1.13.3 -> 1.14.3 does not seem to be the cause, sorry.
It seemed that some of our log files were affected and some were not. The differences are as follows.
- rotation policy; the unaffected files were using a time based rotation policy (e.g. *.log-%Y%m%d-%s), while the affected files were using a fixed window policy (e.g. *.%i.log). The fixed window policy takes longer to rotate than the time based rotation policy (because it requires renaming all the previous files).
- rotation period; unaffected files are rotated in 30 minutes, while affected files are rotated in about 7 minutes. However, it may be just a matter of probability since the more rotations we do, the higher the probability that we will get this problem.
We are planning to change rotation policy and retry upgrading td-agent. And, we will comment some updates if we get.
We enabled debug logging with fluentd 1.13.3 and encountered this issue. The state was same:
- fluentd had been reporting
detected rotation of /var/log/server.log; waiting 5 seconds
withoutThe following tail of /var/log/server.log
. - fluentd tracked the old file descriptor.
- position file had two entries; a correct position and
/var/log/server.log 0000000000000000 0000000000000000
. -
fluentd_tail_file_inode
metrics had been keeping the same inode from this issue occurred.
And, fluentd reported Skip update_watcher because watcher has been already updated by other inotify event
following detected rotation of /var/log/server.log; waiting 5 seconds
. The message seems to be reported by https://github.com/fluent/fluentd/blob/v1.13.3/lib/fluent/plugin/in_tail.rb#L482. It seems to be strongly related position file having two entries.
We searched some other debug logs related this issue but we had no such logs. fluentd reported no other debug logs but only pairs of detected rotation of /var/log/server.log; waiting 5 seconds
and The following tail of /var/log/server.log
.
Thanks for your report!
And, fluentd reported
Skip update_watcher because watcher has been already updated by other inotify event
followingdetected rotation of /var/log/server.log; waiting 5 seconds
. The message seems to be reported by https://github.com/fluent/fluentd/blob/v1.13.3/lib/fluent/plugin/in_tail.rb#L482. It seems to be strongly related position file having two entries.
To tell the truth, I was suspecting it but I couldn't confirm it because I can't yet reproduce it. Your report it very helpful for me.
Although I'm not yet sure the mechanism of this issue, you might be able to avoid this issue by disabling stat watcher (enable_stat_watcher false
).
Similar reports:
issue | date | user | fluentd version | read_from_head | follow_inodes | work around |
---|---|---|---|---|---|---|
#3586 | 2021-12-28 | @jasminehung | v1.14.3 | true | true | pos_file_compaction_interval 3m |
#3614 | 2022-02-02 | @salavessa | v1.14.4 | true | false | enable_stat_watcher false |
2022-02-14 | @mizunashi-mana | 1.14.3, v1.13.3 | false | false | ||
#3661 | 2022-03-08 | @shenmuxiaosen | v1.13.3 | false | false | v1.12.4 was stable |
2022-07-08 | @ljuaneda | ? | ? | ? | v1.11.5 was stable | |
#3800 | 2022-06-27 | @bjg2 | v1.13.3 | true | false | |
#3838 | 2022-08-01 | @bertpassek | v1.15.1 | ? | ? |
From #3586
Your Error Log
There's no error log when this happens, but there were some info logs with special behavior: 2021–12–27 07:37:52 +0000 [info]: #0 detected rotation of /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log; waiting 5 seconds 2021–12–27 07:37:52 +0000 [info]: #0 following tail of /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log ---- at list moment myapp log stop being sent to elasticsearch (no more "following tail" info log for /var/log/containers/myapp) ------ 2021–12–27 07:38:43 +0000 [info]: #0 detected rotation of /var/log/containers/simm-natseventhandler-p2–765f4884fb-qfzhw_simm-a_simm-natseventhandler-p2–07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log; waiting 5 seconds 2021–12–27 07:39:47 +0000 [info]: #0 detected rotation of /var/log/containers/simm-natseventhandler-p2–765f4884fb-qfzhw_simm-a_simm-natseventhandler-p2–07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log; waiting 5 seconds 2021–12–27 07:40:51 +0000 [info]: #0 detected rotation of /var/log/containers/simm-natseventhandler-p2–765f4884fb-qfzhw_simm-a_simm-natseventhandler-p2–07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log; waiting 5 seconds In the pos file we found the records for myapp is strange: 1. There are duplicated rows tracking for myapp container (Will fluentd use the last or first line? or both? or none? ). For the other hundreds of normal pods, there's only 1 row for each container. 2. Record with position "ffffffffffffffff 0000000000000000" keep appearing. Does this mean pos_file is not able to track the container? Results of querying "myapp" from the pos file: Line 241: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log 0000000000bb1725 00000000000a0142 Line 260: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log ffffffffffffffff 0000000000000000 Line 292: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log 0000000000bf8c99 00000000000a0027 Line 302: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log ffffffffffffffff 0000000000000000 Line 561: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log 0000000000afaeea 00000000000a001c Line 785: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log ffffffffffffffff 0000000000000000 Line 804: /var/log/containers/myapp-765f4884fb-qfzhw_myapp-07e1fdd304cd14fa49ad23e034ef806d16c9b5b08c84fa4f269594fc176464a8.log 000000000016e1a8 00000000000a0019
Additional context
Adding pos_file_compaction_interval and set it to 3m seems to fix the log missing problem, but still not able to explain the root cause of the case.
There are several similar reports and most of them said:
When tailing is stopped,
202x–xx–xx xx:xx:xx +0000 [info]: #0 detected rotation of /var/log/containersxxx.log; waiting 5 seconds
was observed but
202x–xx–xx xx:xx:xx +0000 [info]: #0 following tail of /var/log/containers/xxx.log
wasn't observed.
In addition, a user reports that following debug message is observed when log level is debug:
202x–xx–xx xx:xx:xx +0000 [debug]: #0 Skip update_watcher because watcher has been already updated by other inotify event
So, it seems caused by the following code: https://github.com/fluent/fluentd/blob/438a82aead488a86180cd484bbc4e7e344a9032b/lib/fluent/plugin/in_tail.rb#L489-L498
For what's worth: after adding enable_stat_watcher false
back in March (as mentioned by @ashie here) none of my 120+ servers experienced this issue again.
@ashie Is this one fixed?
At v1.15.1 we added a log message like the following to investigate this issue:
2022-08-01 17:57:19 +0900 [warn]: #0 Skip update_watcher because watcher has been already updated by other inotify event path="logs/hige.log" inode=59511737 inode_in_pos_file=0
If you see such log message, please report it.
Hello, @ashie! We have something similar to: "no more log pushed" and found duplication in pos
file.
Like this:
/var/log/pods/pod-l4wrv/service/0.log ffffffffffffffff 000000008b00ad36
/var/log/pods/pod-5bmg8/service/0.log 000000000333fd2e 000000005a00673b
/var/log/pods/pod-5bmg8/service/0.log 0000000004b2a353 000000005a00673e
/var/log/pods/pod-5bmg8/service/0.log 0000000003666a64 000000005a00673c
/var/log/pods/pod-5bmg8/service/0.log 0000000006a53eab 000000005a00673e
/var/log/pods/pod-5bmg8/service/0.log 0000000004a21b23 000000005a01e248
/var/log/pods/pod-5bmg8/service/0.log 0000000000000000 0000000000000000
Maybe this can help you.
@vparfonov Could you provide your in_tail config? Do you use follow_inode true
?
@ashie, looks like not, we use default setting for follow_inode
<source>
@type tail
@id container-input
path "/var/log/pods/**/*.log"
exclude_path ["/var/log/pods/openshift-logging_collector-*/*/*.log", "/var/log/pods/openshift-logging_elasticsearch-*/*/*.log", "/var/log/pods/openshift-logging_kibana-*/*/*.log"]
pos_file "/var/lib/fluentd/pos/es-containers.log.pos"
refresh_interval 5
rotate_wait 5
tag kubernetes.*
read_from_head "true"
skip_refresh_on_startup true
@label @MEASURE
<parse>
@type multi_format
<pattern>
format json
time_format '%Y-%m-%dT%H:%M:%S.%N%Z'
keep_time_key true
</pattern>
<pattern>
format regexp
expression /^(?<time>[^\s]+) (?<stream>stdout|stderr)( (?<logtag>.))? (?<log>.*)$/
time_format '%Y-%m-%dT%H:%M:%S.%N%:z'
keep_time_key true
</pattern>
</parse>
</source>
I looked a bit at this now. What seems to happen is that in the in_tail.rb
it checks if File.readable?
which is false in the case of a broken symlink.
So in the case of log rotation, the symlink to the container log file will be broken for a while, in Kubernetes environments.
If there are no log lines output within the timeout of the detected log rotation, the in_tail will assume the file to be readable, but no file is actually created yet.
This could easily be fixed by polling the symlink, until it's no longer broken (or until some kind of max_retry is hit)
I also think it should be announced somewhere that one should install capng
to get the capabilities lib working properly. Which it was not, for us. And we're using the official docker images for running fluentd.
EDIT: This should be reproducible in a local k8s environment by creating a tiny app that outputs logs every 10th - 30th second. And running fluentd in_tail the log files there.
It is really weird however, that the re-examination done on the next refresh does not always pick up on that the symlinks are no longer broken, and that the file should now be readable. I have however found that in most of the cases where we will get no more logs, after a detected log rotation, the "detected rotation of ..." log entry was sent twice for the same file, on the same node, and in other cases when FluentD is able to pick up tailing the file again later, the log line appears only once.
I created a test case in test/plugin/test_in_tail.rb
for just playing around with this exact event:
sub_test_case "multiple log rotations" do
data(
"128" => ["128-0.log", 128, "msg"],
"256" => ["256-0.log", 256, "msg"],
"512" => ["512-0.log", 512, "msg"],
"1024" => ["1024-0.log", 1024, "msg"],
"2048" => ["2048-0.log", 2048, "msg"],
"4096" => ["4096-0.log", 4096, "msg"],
"8192" => ["8192-0.log", 8192, "msg"],
)
def test_reproduce_err_after_rotations(data)
file, num_lines, msg = data
File.open("#{@tmp_dir}/#{file}", 'wb') do |f|
num_lines.times do
f.puts "#{msg}\n"
end
end
conf = config_element("", "", {
"path" => "#{@tmp_dir}/*.log.link",
"pos_file" => "#{@tmp_dir}/tail.pos",
"refresh_interval" => "1s",
"read_from_head" => "true",
"format" => "none",
"rotate_wait" => "1s",
"follow_inodes" => "true",
"tag" => "t1",
})
link_name="#{@tmp_dir}/#{num_lines}.log.link"
File.symlink(file, link_name)
dl_opts = {log_level: Fluent::Log::LEVEL_DEBUG}
logdev = $stdout
logger = ServerEngine::DaemonLogger.new(logdev, dl_opts)
log_instance = Fluent::Log.new(logger)
rotations = 5
rot_now = 1
d = create_driver(conf, false)
d.instance.log = log_instance
d.run(timeout: 30) do
sleep 1
assert_equal(num_lines, d.record_count)
# rotate logs
while rot_now <= rotations do
sleep 2
puts "unlink #{link_name}"
File.unlink(link_name)
puts "symlink #{num_lines}-#{rot_now}.log #{link_name}"
File.symlink("#{num_lines}-#{rot_now}.log", link_name)
sleep 1
File.open("#{@tmp_dir}/#{num_lines}-#{rot_now}.log", 'wb') do |f|
num_lines.times do
f.puts "#{msg}\n"
end
end
assert_equal(num_lines*rot_now, d.record_count)
rot_now = rot_now + 1
end
end
end
end
In this case, it seems to be working properly. But maybe we can help each-other in reproducing the error?
UPDATE: Changed to multiple files of different sizes, and changed log rotation to how kubelet
actually does it.
UPDATE 2: Code updated! Now it seems I've got something that somehow reproduces the error, in this very state the tests works fine. But if you comment out the "follow_inodes" => "true",
line, the error comes up and the log line Skip update_watcher because watcher has been....
mentioned by @ashie above, is logged all the time. So I think that the follow_inodes
option is not only important for preventing duplicate log messages, but also for tail on wildcards, and symlinks to work properly!
HI All,
We are facing a similar kind of issue where the td-agent.log & .pos files are not updating properly. In some cases , failing after when log rotation occurs, file gets created but nothing is generated as shown below. Even though if we restart the td-agent, the problem is not fixed
As per some previous comments, I have tried modifying the config with the below parameters, but nothing changes
enable_stat_watcher false read_from_head true follow_inodes true refresh_interval 5 rotate_wait 5 skip_refresh_on_startup true pos_file_compaction_interval 30m
As we have systems over RHEL 7/6 we are using fluentd version using 1.11.5 td-agent 3.8.1/3.8.0, in our environment Here is the config we are using, please provide us your inputs here.
[root]:/var/log/td-agent> ls -ltr
-rw-r--r-- 1 td-agent td-agent 68 Aug 29 20:00 s2.pos
-rw-r--r-- 1 td-agent td-agent 384 Aug 29 20:00 b1.pos
-rw-r--r-- 1 td-agent td-agent 37509 Aug 29 20:00 a3.pos
-rw-r--r-- 1 td-agent td-agent 41629 Aug 29 20:00 xx.pos
-rw-r----- 1 td-agent td-agent 18562883 Aug 29 20:38 **td-agent.log-20220830**
-rw-r----- 1 td-agent td-agent **0 Aug 30 03:50 td-agent.log**
****
Config:
<source>
@type tail
path /home/server/xx*.log
exclude_path ["/home/server/xxyy*.log"]
pos_file /var/log/td-agent/xxlog.pos
tag xx.log
enable_stat_watcher false
read_from_head true
follow_inodes true
refresh_interval 5
rotate_wait 5
skip_refresh_on_startup true
pos_file_compaction_interval 30m
<parse>
@type multiline
format_firstline /<20/
format1 /<(?<logDate>.*?)>-<(?<logThread>.*?)>-<(?<level>.*?)>-<(?<logClass>.*?)>-(?<logMessage>.*)/
time_key logDate
time_format %Y/%m/%d-%H:%M:%S.%N
</parse>
</source>
PS. the same config is working fine in few other instances. I am not sure what was the problem here.
Appreciate your help in advance.
We just ran into the same problem again. Without the log line Skip update_watcher because watcher has been....
showing up.
Now we only get these:
2022-09-13 06:36:37 +0000 [warn]: #0 [in_tail_container_logs] got incomplete line at shutdown from /var/log/containers/redacted-service-name-6b6754b75c-swn6z_backend_redacted-service-name-13f36bd8f73334e603ee6fe80642d971f6ea0b25ce966c6ec494c5b226e4d93c.log: ""
Is it wrong that we are running this in a k8s DaemonSet? Should we install FluentD/FluentBit directly on the vm's instead? Are there any special capabilities needed for the Pods created by the DaemonSet? I'm starting to feel a bit lost in this problem...
We just ran into the same problem again. Without the log line
Skip update_watcher because watcher has been....
showing up. Now we only get these:
Are you using fluentd v1.15.1 or later?
Hi, Came looking for solution of https://github.com/fluent/fluentd/issues/3661
I tried disabling enable_stat_watcher
and reducing pos_file_compaction_interval
but still facing fluentd failing to read logs when doing rotation
I am seeing this in the other order
as well as in some pos files have duplicate entry
/var/log/containers/pod1-1-6b755c6694-flbxr_namespace1_pod1-1-f904346d3f6c060b78a594cef2e3161abd1eb10751b13b548ebe28e85c58fd51.log ffffffffffffffff 000000000062ad18
/var/log/containers/pod1-1-6b755c6694-flbxr_namespace1_pod1-1-f904346d3f6c060b78a594cef2e3161abd1eb10751b13b548ebe28e85c58fd51.log ffffffffffffffff 000000000062ac7f
/var/log/containers/pod1-1-6b755c6694-flbxr_namespace1_pod1-1-f904346d3f6c060b78a594cef2e3161abd1eb10751b13b548ebe28e85c58fd51.log ffffffffffffffff 000000000062ad18
/var/log/containers/pod1-1-6b755c6694-flbxr_namespace1_pod1-1-f904346d3f6c060b78a594cef2e3161abd1eb10751b13b548ebe28e85c58fd51.log ffffffffffffffff 000000000062ac7f
/var/log/containers/pod1-1-6b755c6694-flbxr_namespace1_pod1-1-f904346d3f6c060b78a594cef2e3161abd1eb10751b13b548ebe28e85c58fd51.log ffffffffffffffff 000000000062ad18
/var/log/containers/pod1-1-6b755c6694-flbxr_namespace1_pod1-1-f904346d3f6c060b78a594cef2e3161abd1eb10751b13b548ebe28e85c58fd51.log ffffffffffffffff 000000000062ac7f
Do we have any updated on this @ashie ?
Are there any workarounds for this, that anyone has been able to detect?
My current thoughts are:
- Run FluentD on VM directly, instead of running it as a DaemonSet in K8S (w. mounted /var/log)
- Create our own tail plugin, that might be able to to a better job with tailing and log rotations.
- E.g. Not only watch the symlinks, but also watch the files symlinks point to, to be able to get an event-driven async way of detecting log rotation, and when new log files are created.
- Alternatively create some middle-man app that will tail and detect log rotations properly, and send logs to FluentD via e.g. http
Please contribute with whatever solutions you've set up to get this working as expected. In our company, we must follow government laws about audit logging. So we cannot afford to loose logs due to this bug.
We just ran into the same problem again. Without the log line
Skip update_watcher because watcher has been....
showing up. Now we only get these:Are you using fluentd v1.15.1 or later?
Yes. We are currently using the Docker image fluent/fluentd:v1.15.1-debian-1.1
Fluentd v1.15.1 with enable_stat_watcher false
and pos_file_compaction_interval 3m
and follow_inodes true
still drop 50~70% logs for us. The load is around 20k logs/s
We see the pos file have duplicate entries and all zero inoder number
Our workaround: Using fluentbit in_tail plugin, then out_forward to fluentd using existing rules.
This bug happens on our environment for version 1.15.1 with all workaround above. Drop almost 70% logs with 20k logs/s.
We see exactly the same POS file corrupt mention in this ticket.
Our workaround: change to fluentbit tail plugin , out_forward plugin to fluentd to keep existing rules.
Morten Hekkvang @.***>於 2022年10月10日 週一,下午3:55寫道:
Are there any workarounds for this, that anyone has been able to detect?
My current thoughts are:
- Run FluentD on VM directly, instead of running it as a DaemonSet in K8S (w. mounted /var/log)
- Create our own tail plugin, that might be able to to a better job with tailing and log rotations.
- E.g. Not only watch the symlinks, but also watch the files symlinks point to, to be able to get an event-driven async way of detecting log rotation, and when new log files are created.
- Alternatively create some middle-man app that will tail and detect log rotations properly, and send logs to FluentD via e.g. http
Please contribute with whatever solutions you've set up to get this working as expected. In our company, we must follow government laws about audit logging. So we cannot afford to loose logs due to this bug.
— Reply to this email directly, view it on GitHub https://github.com/fluent/fluentd/issues/3614#issuecomment-1272919266, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQ5JE7MQNA6YHSCIWLJS3WCPDY3ANCNFSM5NJYHT4A . You are receiving this because you are subscribed to this thread.Message ID: @.***>
At v1.15.1 we added a log message like the following to investigate this issue:
2022-08-01 17:57:19 +0900 [warn]: #0 Skip update_watcher because watcher has been already updated by other inotify event path="logs/hige.log" inode=59511737 inode_in_pos_file=0
If you see such log message, please report it.
Getting similar log
2022-11-14 23:14:57 +0000 [info]: #0 detected rotation of /var/log/pods/processor-0-65dc565cc8-hq469_0f06669b-4521-4297-80db-ead8aaecde4e/core-ford-processor-0/1.log; waiting 5 seconds
2022-11-14 23:14:57 +0000 [warn]: #0 Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/pods/processor-0-65dc565cc8-hq469_0f06669b-4521-4297-80db-ead8aaecde4e/core-ford-processor-0/1.log" inode=4397239 inode_in_pos_file=0
fluentd v1.15-debian-opensearch-1 -> gem 'fluentd' version '1.15.3'
<source>
@type tail
@id in_tail_container_logs
path /var/log/containers/*.log
follow_inodes true
refresh_interval 10
rotate_wait 1
enable_stat_watcher false
pos_file /var/log/fluentd-containers.log.pos
pos_file_compaction_interval 12h
tag raw.kubernetes.*
read_from_head true
read_bytes_limit_per_second 1048576000
max_line_size 2MB
<parse>
@type multi_format
<pattern>
format json
time_key time
time_type string
time_format "%Y-%m-%dT%H:%M:%S.%NZ"
keep_time_key false
</pattern>
<pattern>
format regexp
expression /^(?<time>.+) (?<stream>stdout|stderr)( (.))? (?<log>.*)$/
time_format '%Y-%m-%dT%H:%M:%S.%NZ'
keep_time_key false
</pattern>
</parse>
emit_unmatched_lines true
</source>
Last logs of fluentd before pushing stops:
2023-01-25 17:55:53 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/pod-68f6cb4bb4-vlfcq_default_gateway-rest-23c1ac09792e85465483b2d641e58fa6d5d8181924bc7b08f739779b2bf1d02f.log; waiting 1.0 seconds
2023-01-25 17:55:53 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/pod-68f6cb4bb4-vlfcq_default_gateway-rest-23c1ac09792e85465483b2d641e58fa6d5d8181924bc7b08f739779b2bf1d02f.log
When I restart fluentd then it starts to push all the logs that were missing. If they are still on the node.
We configured docker to produce log files with 500gb and we keep 1 rotated file.
We encountered the same error. I've been trying to reproduce without success so far. In our case we did a large upgrade from version 1.6.3 to 1.15.3.
Our failing deployment is td-agent with file tailing running as a Kubernetes daemonset.
Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/k8s-...-controller.log" inode=2883733 inode_in_pos_file=2883972
@ashie we have a customer who claims they applied the following patch and the issue goes away. I do not have a reliable way to reproduce the problem. We are using v1.14.6
--- in_tail.rb.org 2023-02-13 17:04:42.353997075 +0900
+++ in_tail.rb 2023-02-13 17:04:39.030016491 +0900
@@ -354,13 +354,15 @@
def existence_path
hash = {}
- @tails.each_key {|target_info|
- if @follow_inodes
- hash[target_info.ino] = target_info
- else
- hash[target_info.path] = target_info
- end
- }
+ if @follow_inodes
+ @tails.each {|ino, tw|
+ hash[tw.ino] = TargetInfo.new(tw.path, tw.ino)
+ }
+ else
+ @tails.each {|path, tw|
+ hash[tw.path] = TargetInfo.new(tw.path, tw.ino)
+ }
+ end
hash
end
@@ -441,8 +443,11 @@
begin
target_info = TargetInfo.new(target_info.path, Fluent::FileWrapper.stat(target_info.path).ino)
- @tails.delete(target_info)
- @tails[target_info] = tw
+ if @follow_inodes
+ @tails[target_info.ino] = tw
+ else
+ @tails[target_info.path] = tw
+ end
tw.on_notify
rescue Errno::ENOENT, Errno::EACCES => e
$log.warn "stat() for #{target_info.path} failed with #{e.class.name}. Drop tail watcher for now."
@@ -462,9 +467,17 @@
def stop_watchers(targets_info, immediate: false, unwatched: false, remove_watcher: true)
targets_info.each_value { |target_info|
if remove_watcher
- tw = @tails.delete(target_info)
+ if @follow_inodes
+ tw = @tails.delete(target_info.ino)
+ else
+ tw = @tails.delete(target_info.path)
+ end
else
- tw = @tails[target_info]
+ if @follow_inodes
+ tw = @tails[target_info.ino]
+ else
+ tw = @tails[target_info.path]
+ end
end
if tw
tw.unwatched = unwatched
@@ -478,10 +491,19 @@
end
def close_watcher_handles
- @tails.keys.each do |target_info|
- tw = @tails.delete(target_info)
- if tw
- tw.close
+ if @follow_inodes
+ @tails.keys.each do |ino|
+ tw = @tails.delete(ino)
+ if tw
+ tw.close
+ end
+ end
+ else
+ @tails.keys.each do |path|
+ tw = @tails.delete(path)
+ if tw
+ tw.close
+ end
end
end
end
@@ -500,26 +522,25 @@
end
rotated_target_info = TargetInfo.new(target_info.path, pe.read_inode)
- rotated_tw = @tails[rotated_target_info]
- new_target_info = target_info.dup
if @follow_inodes
+ rotated_tw = @tails[rotated_target_info.ino]
+ new_target_info = target_info.dup
new_position_entry = @pf[target_info]
if new_position_entry.read_inode == 0
# When follow_inodes is true, it's not cleaned up by refresh_watcher.
# So it should be unwatched here explicitly.
rotated_tw.unwatched = true
- # Make sure to delete old key, it has a different ino while the hash key is same.
- @tails.delete(rotated_target_info)
- @tails[new_target_info] = setup_watcher(new_target_info, new_position_entry)
- @tails[new_target_info].on_notify
+ @tails[new_target_info.ino] = setup_watcher(new_target_info, new_position_entry)
+ @tails[new_target_info.ino].on_notify
end
else
- # Make sure to delete old key, it has a different ino while the hash key is same.
- @tails.delete(rotated_target_info)
- @tails[new_target_info] = setup_watcher(new_target_info, pe)
- @tails[new_target_info].on_notify
+ rotated_tw = @tails[rotated_target_info.path]
+ new_target_info = target_info.dup
+
+ @tails[new_target_info.path] = setup_watcher(new_target_info, pe)
+ @tails[new_target_info.path].on_notify
end
detach_watcher_after_rotate_wait(rotated_tw, pe.read_inode) if rotated_tw
end
--- position_file.rb.org 2023-02-13 15:02:14.671608849 +0900
+++ position_file.rb 2023-02-13 15:06:09.098223852 +0900
@@ -250,20 +250,6 @@
end
end
- TargetInfo = Struct.new(:path, :ino) do
- def ==(other)
- return false unless other.is_a?(TargetInfo)
- self.path == other.path
- end
-
- def hash
- self.path.hash
- end
-
- def eql?(other)
- return false unless other.is_a?(TargetInfo)
- self.path == other.path
- end
- end
+ TargetInfo = Struct.new(:path, :ino)
end
end
our typical config is:
<source>
@type tail
@id container-input
path "/var/log/pods/*/*/*.log"
exclude_path ["/var/log/pods/openshift-logging_collector-*/*/*.log", "/var/log/pods/openshift-logging_elasticsearch-*/*/*.log", "/var/log/pods/openshift-logging_kibana-*/*/*.log", "/var/log/pods/openshift-logging_*/loki*/*.log", "/var/log/pods/*/*/*.gz", "/var/log/pods/*/*/*.tmp"]
pos_file "/var/lib/fluentd/pos/es-containers.log.pos"
follow_inodes true
refresh_interval 5
rotate_wait 5
tag kubernetes.*
read_from_head "true"
skip_refresh_on_startup true
@label @CONCAT
<parse>
@type regexp
expression /^(?<@timestamp>[^\s]+) (?<stream>stdout|stderr) (?<logtag>[F|P]) (?<message>.*)$/
time_key '@timestamp'
keep_time_key true
</parse>
</source>
One significant difference I do see with the original report of the issue and how we collect logs is the symlink. Collecting from /var/log/pods
which is the preferred kubernetes location AFAIK does not utilize symlinks as was done with /var/log/containers
.
@ashie I believe the issue is because the filesystem reuse of inodes and fluent using inode as the "key" for maintaining watches. Our customer provided the following where then position entry is converted to decimal:
-- Pos file
/var/log/pods/openshift-image-registry_node-ca-k6thz_78bc195f-591c-4af5-afb1-a31c601850e9/node-ca/0.log 18446744073709551616 195064732
/var/log/pods/openshift-image-registry_node-ca-k6thz_78bc195f-591c-4af5-afb1-a31c601850e9/node-ca/0.log 18446744073709551616 195064733
/var/log/pods/openshift-image-registry_node-ca-k6thz_78bc195f-591c-4af5-afb1-a31c601850e9/node-ca/0.log 58444023 195064732
/var/log/pods/openshift-image-registry_node-ca-k6thz_78bc195f-591c-4af5-afb1-a31c601850e9/node-ca/0.log 35228450 195064763
Note there are multiple entries with the same inode where one is identified as "unwatched" and the other is still open. The logic then closes the watch on the inode even though it is still in use