beats icon indicating copy to clipboard operation
beats copied to clipboard

Filebeat stays disconnected after logstash performance problems (even long after those are resolved)

Open Mekk opened this issue 5 years ago • 13 comments
trafficstars

I am not sure whether this is sufficient for detailed analysis, but I decided to report the problem so sb could take a look at filebeat behaviour having this context in mind.

Somewhat strange situation, but I observed it on many machines (filebeat 7.5.2, logstash also 7.5.2):

  • logstash (to which filebeat's connect) faced big performance problems (it was due to inefficient XML parsing, https://github.com/elastic/logstash/issues/11599 , but this is only context of this issue, if I were to test this behaviour on purpose I'd probably write some busy loop in ruby script triggered from a filter)

  • sooner or later filebeat started to report timeouts (properly, logstash didn't manage to handle communication fast enought)

  • … but for some reason filebeat remained in this state forever. Even long after logstash was restarted and the problem it faced resolved, running filebeat instance never recovered (the instance I forgot to restart was still disconnected more than 24 hours after the problem was resolved).

Restarting filebeat helped, but there is sth wrong in the fact that it didn't manage to recover itself (after all, in normal logstash restart/temporary inaccessibility case I never faced similar problems)

Logs picture. Here the problem started:

2020-02-12T19:44:34.984+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:55824->10.92.23.57:5044: read: connection reset by peer
2020-02-12T19:44:34.984+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:55824->10.92.23.57:5044: read: connection reset by peer
2020-02-12T19:44:35.025+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: client is not connected
2020-02-12T19:44:36.909+0100    ERROR   pipeline/output.go:121  Failed to publish events: client is not connected
2020-02-12T19:44:36.909+0100    INFO    pipeline/output.go:95   Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-12T19:44:40.506+0100    ERROR   pipeline/output.go:100  Failed to connect to backoff(async(tcp://logstash.test.local:5044)): dial tcp 10.92.23.57:5044: connect: connection refused
2020-02-12T19:44:40.506+0100    INFO    pipeline/output.go:93   Attempting to reconnect to backoff(async(tcp://logstash.test.local:5044)) with 1 reconnect attempt(s)
2020-02-12T19:44:47.796+0100    ERROR   pipeline/output.go:100  Failed to connect to backoff(async(tcp://logstash.test.local:5044)): dial tcp 10.92.23.57:5044: connect: connection refused
2020-02-12T19:44:47.796+0100    INFO    pipeline/output.go:93   Attempting to reconnect to backoff(async(tcp://logstash.test.local:5044)) with 2 reconnect attempt(s)
…
2020-02-12T19:46:18.110+0100    INFO    pipeline/output.go:93   Attempting to reconnect to backoff(async(tcp://logstash.test.local:5044)) with 5 reconnect attempt(s)
2020-02-12T19:46:18.111+0100    INFO    pipeline/output.go:105  Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-12T19:46:50.516+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 16.193.144.102:55834->10.92.23.57:5044: i/o timeout
2020-02-12T19:46:50.556+0100    ERROR   logstash/async.go:256   Failed to publish events caused by: client is not connected

This is all OK, logstash was massacring CPU and likely was unable to keep up with connections.

But then, the problem was resolved early next day, logstash was restarted, those filebeats which were restarted work since then happily. But the filebeat instance which remained unrestarted still doesn't push logs and logs things like (random snippet from log taken more than 24 hours since the problem was resolved):

2020-02-14T16:30:47.871+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:30:47.872+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:31:17.920+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 16.193.144.102:33111->10.92.23.57:5044: i/o timeout
2020-02-14T16:31:17.967+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2020-02-14T16:31:19.305+0100	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2020-02-14T16:31:19.306+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:31:19.307+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:31:49.350+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 16.193.144.102:33112->10.92.23.57:5044: i/o timeout
2020-02-14T16:31:49.391+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2020-02-14T16:31:50.807+0100	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2020-02-14T16:31:50.807+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:31:50.808+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:32:20.854+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 16.193.144.102:33113->10.92.23.57:5044: i/o timeout
2020-02-14T16:32:20.896+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2020-02-14T16:32:22.755+0100	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2020-02-14T16:32:22.755+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:32:22.756+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:32:52.804+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 16.193.144.102:33114->10.92.23.57:5044: i/o timeout
2020-02-14T16:32:52.844+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2020-02-14T16:32:54.005+0100	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2020-02-14T16:32:54.005+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:32:54.007+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established
2020-02-14T16:33:24.052+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: read tcp 16.193.144.102:33115->10.92.23.57:5044: i/o timeout
2020-02-14T16:33:24.095+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: client is not connected
2020-02-14T16:33:25.708+0100	ERROR	pipeline/output.go:121	Failed to publish events: client is not connected
2020-02-14T16:33:25.708+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://logstash.test.local:5044))
2020-02-14T16:33:25.710+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://logstash.test.local:5044)) established

and so on, and so on, and so on, until restart (after which everything started to work OK).

This is the extreme case, but in general any filebeat instance which started to report errors like above had to be restarted.

To my naive eye it looks as if something desynchronized here, as if new established connections were closed due to old error notes or due to backpool of old errors or sth like that.

The error context may be significant because of specific behaviour – upstream connections were not closed by remote side, but simply were not handled (maybe some buffers filled etc).

PS It may matter, or not, that I use a few sections and there are plenty of log files

Mekk avatar Feb 14 '20 17:02 Mekk

Also reported with Auditbeat 7.6.0 and Logstash 7.4.2 in https://github.com/elastic/beats/issues/16864.

jsoriano avatar Mar 06 '20 10:03 jsoriano

Pinging @elastic/integrations-services (Team:Services)

elasticmachine avatar Mar 06 '20 10:03 elasticmachine

This discuss topic could be related too: https://discuss.elastic.co/t/filebeat-performance-stall-sometimes/222207

jsoriano avatar Mar 06 '20 10:03 jsoriano

While events are actively processed by Logstash, Logstash sends an 'heartbeat' signal to Filebeat every 10s I think. Filebeat times out the connection if no signal was received for the last 30s (see setting: output.logstash.timeout).

After the timeout we're even seeing: Failed to connect to backoff(async(tcp://logstash.test.local:5044)): dial tcp 10.92.23.57:5044: connect: connection refused. Filebeat can't reconnect. Either there are network problems, or there is a connection limit reached in Logstash. Maybe the the workers (not sure, but I think Logstash uses a separate worker pool for beats communication) didn't get CPU time by the OS.

urso avatar Mar 06 '20 12:03 urso

Same problem with Filebeat and Logstash 7.7.1. This is a serious trouble if there is periodic work on the network that makes Logstash unavailable for a while.

M0rdecay avatar Aug 03 '20 07:08 M0rdecay

Just hit this very same problem here... this is a silent failure that caused us to miss days of logs because filebeat either failed hard and forced the initd/systemd subsystem to restart it, nor logged or retried the connection.

We do have the output.logstash.timeout setting but it seems to be of no effect in this specific scenario.

filebeat v7.5.2

danlsgiga avatar Feb 25 '21 18:02 danlsgiga

bump, impacts filebeat 6.8 as well.

seadub avatar May 06 '21 15:05 seadub

Also seeing this in filebeat 7.15 deployed as a Task in AWS ECS with an image from ECR public gallery.

https://gallery.ecr.aws/elastic/filebeat

ERROR [logstash] logstash/async.go:280 Failed to publish events caused by: client is not connected

whyayala avatar Nov 01 '21 21:11 whyayala

Hi, We have the same problem with filebeat and winlogbeat 7.16.3. Tried every combination of firewall rules ( port 5044 to and from the logstash server, rule with all ports open). Creating these FW rule seemed to resolve the problem so we started to narrow down the FW rules. Now we have a situation where some clients (VLAN1) works perfectly without any additional FW rule (only 5044 allowed form the client to the server), and we have clients (VLAN2) which won't work after a logstash restart although all ports are open. It is a very blocking issue for us. Can someone help?

barionoperators avatar Feb 02 '22 15:02 barionoperators

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

elasticmachine avatar Mar 31 '22 13:03 elasticmachine

Hi, same issue here ERROR logstash/async.go:256 Failed to publish events caused by: write tcp 10.2.... trying to run filebeat as a sidecar in a helm chart... I get the error when ever there is a JSON parse error but other then that logs are not showing up in elastic. can't tell if they reach logstash

Carmelly212 avatar Jun 07 '22 09:06 Carmelly212

hi, We have the same issue; i describe it here: https://discuss.elastic.co/t/filebeat-connection-error-with-logstash/327208

sebglon avatar Mar 08 '23 08:03 sebglon

Hi! We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

botelastic[bot] avatar Mar 07 '24 08:03 botelastic[bot]

👍

henrikno avatar Jul 11 '24 22:07 henrikno

Reminder of the test scenario:

  • provoke situation where logstash is cpu-starved (in my case it was doing sth very cpu-intensive during log processing, in real case it was xml parser mishap, but anything would likely do – just write ruby filter which is doing some calculations for 30 seconds for ⅓ entries)

  • wait until filebeat reports connection problems (while logstash is working and doing that slow cpu-intensive processing)

  • „fix” (remove cpu-intensive filter) and restart logstash

  • observe whether filebeat manages to connect back

Mekk avatar Jul 12 '24 06:07 Mekk