Fluentd forwarder status page is displayed with a huge delay when aggregator node is responding slowly
OS: centos 7 Fluentd version: td-agent-3.2.0-0.el7.x86_64
When aggregator node is failing or responding very slowly while under heavy load, it might take up to 1-2 minutes to get a status page /api/plugins.json on a forwarder node.
Steps to reproduce
Forwarder config
<source>
@type monitor_agent
bind 127.0.0.1
port 24220
</source>
<source>
@type forward
bind 127.0.0.1
port 24224
</source>
<match **>
@type forward
heartbeat_type tcp
send_timeout 60s
recover_wait 10s
heartbeat_interval 1s
# increased this while testing
phi_threshold 160000
hard_timeout 120s
<server>
name logs1
host 172.31.3.5
port 8889
weight 60
</server>
flush_interval 10s
buffer_type file
buffer_path /var/log/fluentd/buffer/forward
buffer_chunk_limit 4m
buffer_queue_limit 4096
num_threads 2
expire_dns_cache 600
</match>
I make some service send logs to the forwarder.
Then on aggregator node I execute
# iptables -A INPUT -m statistic --mode random --probability 0.8 --source forwarder.node.ip.address -j DROP
On the forwarder node I execute the following curl request in a loop
# while true; do timeout 2 curl -s http://localhost:24220/api/plugins.json > /dev/null && echo ok || echo failure; sleep 1; done
In some time it starts showing "failure".
When I flush iptables rules on the aggregator node with
iptables -F
it gets back to normal.
It happens not all the time, but in a rather big percentage of cases it happens.
td-agent 2.5 is not affected.
Also I noticed that docker services that send logs to the forwarder stop responding sometimes as well. But was not able to reproduce it yet in my test environment.
Thanks.
Regards, Sergey
We also have a same problem with v1.2.5. The delay seems to depend on the value of send_timeout. Prometheus plugin delayed as well.
Hmm... we changed buffer/output implementation and it may cause this problem. Will check soon.
So after switching forwarders to td-agent 2.5, we never had issues with timeouts of our docker-based tcp services that send logs to fluentd. So I believe this issue affected docker sevices as well.