fluentd
fluentd copied to clipboard
in_tail: Fluentd process goes unresponsive/hung and not tailing any of the log files
Describe the bug
As commented before in the https://github.com/fluent/fluentd/issues/3614#issuecomment-1231158274, the fluentd goes unresponsive or hung.
We tried upgrading the fluentd version from v1.11.5 to v1.15.2 on one of the server where we had a issue, but still we face the issue with the latest build.
[root@aau-lgiu06fms1 06:36:28]:/var/log/td-agent> ls -ltr total 10024 drwxr-xr-x 2 td-agent td-agent 6 Aug 22 23:54 buffer -rw-r--r-- 1 td-agent td-agent 59796 Aug 25 03:17 td-agent.log-20220825.gz -rw-r----- 1 td-agent td-agent 35586 Aug 25 11:00 td-agent.log-20220826.gz -rw-r----- 1 td-agent td-agent 129378 Aug 26 19:33 td-agent.log-20220827.gz -rw-r----- 1 td-agent td-agent 878069 Aug 29 20:38 td-agent.log-20220830.gz -rw-r----- 1 td-agent td-agent 53933 Aug 31 03:47 td-agent.log-20220831.gz -rw-r--r-- 1 td-agent td-agent 306 Aug 31 15:05 t1.pos -rw-r--r-- 1 td-agent td-agent 89 Aug 31 15:13 o1.pos -rw-r--r-- 1 td-agent td-agent 73 Aug 31 15:14 s3.pos -rw-r--r-- 1 td-agent td-agent 73 Aug 31 15:14 s2.pos -rw-r--r-- 1 td-agent td-agent 96 Aug 31 15:14 b1.pos -rw-r--r-- 1 td-agent td-agent 198 Aug 31 15:14 a1.pos -rw-r--r-- 1 td-agent td-agent 19583 Aug 31 15:14 a2.pos -rw-r--r-- 1 td-agent td-agent 13033 Aug 31 15:14 a3.pos -rw-r--r-- 1 td-agent td-agent 68 Aug 31 15:14 s1.pos -rw-r----- 1 td-agent td-agent 9027630 Aug 31 18:25 td-agent.log-20220901 -rw-r----- 1 td-agent td-agent 0 Sep 1 03:23 td-agent.log
To Reproduce
No error log message found in the logs.
Expected behavior
Logs to be pushed as usual after file rotation as fluentd
Your Environment
- Fluentd version:1.11.5/1.15.2
- TD Agent version:3.8.1/4.4.1
- Operating system:RHEL7.9
- Kernel version:3.10.0-1160.36.2.el7.x86_64
Your Configuration
<source>
@type tail
path /home/server/xx*.log
exclude_path ["/home/server/xxyy*.log"]
pos_file /var/log/td-agent/xxlog.pos
tag xx.log
enable_stat_watcher false
read_from_head true
follow_inodes true
refresh_interval 5
rotate_wait 5
skip_refresh_on_startup true
pos_file_compaction_interval 30m
<parse>
@type multiline
format_firstline /<20/
format1 /<(?<logDate>.*?)>-<(?<logThread>.*?)>-<(?<level>.*?)>-<(?<logClass>.*?)>-(?<logMessage>.*)/
time_key logDate
time_format %Y/%m/%d-%H:%M:%S.%N
</parse>
</source>
Your Error Log
No specific error log identified apart from the pattern not matched errors
2022-09-01 06:49:17 +0100 [info]: #0 adding forwarding server 'XXXXXXXX:24224' host="XXXXXXX" port=24224 weight=60 plugin_id="object:94c"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: #0 starting fluentd worker pid=10319 ppid=10316 worker=0
2022-09-01 06:49:17 +0100 [info]: #0 delayed_commit_timeout is overwritten by ack_response_timeout
2022-09-01 06:49:17 +0100 [info]: #0 following tail of /usr/ops/e1.log
2022-09-01 06:49:17 +0100 [warn]: #0 pattern not matched: "2022-08-31 15:15:01 [/usr/ops] /usr/ops/e1.log {} -- "
2022-09-01 06:49:17 +0100 [warn]: #0 pattern not matched: "2022-08-31 15:15:01 [/usr/ops] :: total used free shared buff/cache available"
Additional context
We are seeing this issue on both 1.11.5 and 1.15.2 versions.
@ashie request to please let me know if any workaround or remediations available for this unresponsive nature of td-agent issue.
Thanks in advance.
Could you try to get stack trace of Fluentd's worker process by sending CONT
signal?
- https://docs.fluentd.org/deployment/trouble-shooting#dump-fluentds-internal-information
- https://github.com/fluent/sigdump#usage
$ kill -CONT <pid of fluentd worker>
$ cat /tmp/sigdump-<pid of fluentd worker>.log
@ashie Thanks for the response. here are the extracted sigdmp files from two instances. Please verify and let me know if you find any issues sigdmp1.txt sigdmp2.txt
Please let me know if any information needed for further investigation.
@ashie did you get any chance to go thorough the above sigdmp files? Please let me know if you need any further details here to identify the issue.
It seems that your log is supervisor's one, not worker's one. Fluentd launches 1 parent process (supervisor) and 1 child process (worker) by default.
[Fluentd Supervisor]
|
+-- [Fluentd Worker]
We need woker's log. If you don't know which process is worker, please get logs for all fluentd processes in your system.
sigdump-L2G.log sigdump-N3S.log @ashie Please find the attached sigdmps of worker nodes. Please verify and let me know if any information is needed.
sigdump-L2G.log sigdump-N3S.log @ashie Please find the attached sigdmps of worker nodes. Please verify and let me know if any information is needed.
@ashie Can you verify these files and let me know if you find any trace of issue in these files?
Thanks in advance.
It seems that fluentd process isn't stuck, probably it's just waiting next line.
Now I've noticed that you are using multiline parser.
In this case, in_tail won't emit event until next <20
line is arrived.
How about using multiline_flush_interval?
It seems that fluentd process isn't stuck, probably it's just waiting next line. Now I've noticed that you are using multiline parser. In this case, in_tail won't emit event until next
<20
line is arrived. How about using multiline_flush_interval?
@ashie I tried placing the multiline_flush_interval 5s, but still unresponsiveness of td-agent.log continues as shown in these snippets:
The log files shows this one line:
Do I need to modify any thing in the output config as well?
<match *.log>
@type forward
require_ack_response true
expire_dns_cache 300
dns_round_robin true
transport tls
tls_cert_path /etc/td-agent/ca-bundle.crt
keepalive true
keepalive_timeout 300s
#include_timestamp true
It seems that fluentd process isn't stuck, probably it's just waiting next line. Now I've noticed that you are using multiline parser. In this case, in_tail won't emit event until next
<20
line is arrived. How about using multiline_flush_interval?@ashie I tried placing the multiline_flush_interval 5s, but still unresponsiveness of td-agent.log continues as shown in these snippets:
The log files shows this one line:
Do I need to modify any thing in the output config as well?
<match *.log> @type forward require_ack_response true expire_dns_cache 300 dns_round_robin true transport tls tls_cert_path /etc/td-agent/ca-bundle.crt keepalive true keepalive_timeout 300s #include_timestamp true host XXXXXXXXXXX port XXXXX flush_interval 5s
@ashie any further I can try to see if that can resolve the issue.
It seems that fluentd process isn't stuck, probably it's just waiting next line. Now I've noticed that you are using multiline parser. In this case, in_tail won't emit event until next
<20
line is arrived. How about using multiline_flush_interval?@ashie I tried placing the multiline_flush_interval 5s, but still unresponsiveness of td-agent.log continues as shown in these snippets:
The log files shows this one line:
Do I need to modify any thing in the output config as well? <match *.log> @type forward require_ack_response true expire_dns_cache 300 dns_round_robin true transport tls tls_cert_path /etc/td-agent/ca-bundle.crt keepalive true keepalive_timeout 300s #include_timestamp true host XXXXXXXXXXX port XXXXX flush_interval 5s
@ashie any further I can try to see if that can resolve the issue.
@ashie Any update on this to mitigate the issue?
This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 7 days
@ashie can we get any update on this.. is there any fix from the product end.. we still face this issue and seem the log rotate file issue..
Do you mean that in_tail
does not collect the contents of the file at all?
Restarting Fluentd does not recover?
If so, #3614 may not be related.
It may be possible that the config is wrong or that the file format is not supported.
Could you give us the sample logs?
It's been a while, so I close this. If this issue still occurs, please reopen this.