fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

Fluentd forwarder status page is displayed with a huge delay when aggregator node is responding slowly

Open sergeyarl opened this issue 7 years ago • 3 comments

OS: centos 7 Fluentd version: td-agent-3.2.0-0.el7.x86_64

When aggregator node is failing or responding very slowly while under heavy load, it might take up to 1-2 minutes to get a status page /api/plugins.json on a forwarder node.

Steps to reproduce

Forwarder config

<source>
  @type monitor_agent
  bind 127.0.0.1
  port 24220
</source>

<source>
  @type forward
  bind 127.0.0.1
  port 24224
</source>

<match **>
  @type forward

  heartbeat_type tcp
  send_timeout 60s
  recover_wait 10s
  heartbeat_interval 1s
 # increased this while testing 
  phi_threshold 160000
  hard_timeout 120s

  <server>
    name logs1
    host 172.31.3.5
    port 8889
    weight 60
  </server>

  flush_interval 10s

  buffer_type file
  buffer_path /var/log/fluentd/buffer/forward
  buffer_chunk_limit 4m
  buffer_queue_limit 4096
  num_threads 2
  expire_dns_cache 600
</match>

I make some service send logs to the forwarder.

Then on aggregator node I execute

# iptables -A INPUT -m statistic --mode random --probability 0.8 --source forwarder.node.ip.address -j DROP

On the forwarder node I execute the following curl request in a loop

# while true; do timeout 2 curl -s http://localhost:24220/api/plugins.json > /dev/null && echo ok || echo failure; sleep 1; done In some time it starts showing "failure".

When I flush iptables rules on the aggregator node with

iptables -F

it gets back to normal.

It happens not all the time, but in a rather big percentage of cases it happens.

td-agent 2.5 is not affected.

Also I noticed that docker services that send logs to the forwarder stop responding sometimes as well. But was not able to reproduce it yet in my test environment.

Thanks.

Regards, Sergey

sergeyarl avatar Sep 23 '18 12:09 sergeyarl

We also have a same problem with v1.2.5. The delay seems to depend on the value of send_timeout. Prometheus plugin delayed as well.

summerwind avatar Oct 05 '18 13:10 summerwind

Hmm... we changed buffer/output implementation and it may cause this problem. Will check soon.

repeatedly avatar Oct 05 '18 13:10 repeatedly

So after switching forwarders to td-agent 2.5, we never had issues with timeouts of our docker-based tcp services that send logs to fluentd. So I believe this issue affected docker sevices as well.

sergeyarl avatar Oct 29 '18 09:10 sergeyarl