fluent-plugin-papertrail icon indicating copy to clipboard operation
fluent-plugin-papertrail copied to clipboard

Plugin does not handle idle disconnects in a timely fashion. Results in dropped logs

Open UserNotFound opened this issue 2 years ago • 3 comments
trafficstars

The Papertrail documentation says idle connection timeouts should be handled by the client, ie this plugin (emphasis mine):

Papertrail automatically closes idle TCP connections after 15 minutes of inactivity. Log senders should already handle this as part of a full implementation, since connections are routinely affected by Internet reachability changes. Senders can elect to immediately reconnect, or leave the connection closed until new log messages are available.

This plugin does retry, but the behavior is still rather impactful for delivery:

  • plugin makes a socket connection to papertrail
  • some time passes without logging, and papertrail closes the connection due to the idle timeout
  • logs are generated, and the plugin attempts to write to the socket
  • after 10+ minutes, (possibly only triggered by continued writes) the socket write times out
  • a new socket is created, and the chunk is "retried successfully", but log messages are missing on Papertrail

When I started debugging this, I thought it was just delayed messages, but I'm also finding that some messages are dropped entirely during this process. Is it possible to handle this more gracefully, like with a tcp keepalive?

This syslog output plugin seems to do so (the Socket changes are here)


Steps to reproduce:

  1. configure fluentd to capture logs and send them via this plugin to papertrail
  2. Log infrequently, eg:
root@89f331735970:/app# date; sleep 3600; date
Tue Feb 14 16:39:04 UTC 2023
Tue Feb 14 17:39:04 UTC 2023
root@89f331735970:/app# while true; do date; sleep 10; done
Tue Feb 14 17:55:03 UTC 2023
Tue Feb 14 17:55:13 UTC 2023
Tue Feb 14 17:55:23 UTC 2023
Tue Feb 14 17:55:33 UTC 2023
Tue Feb 14 17:55:43 UTC 2023
  1. See that the log lines for "17:39:04" "17:55:03" and "17:55:13" are dropped and not received in papertrail Screen Shot 2023-02-14 at 10 58 38

  2. Confirm in the fluentd logs that there was a socket write timeout error

2023-02-14 16:14:50 +0000 [info]: #0 initializing tcp socket for logs3.papertrailapp.com:redacted_port
2023-02-14 17:55:24 +0000 [warn]: #0 failed to flush the buffer. retry_times=0 next_retry_time=2023-02-14 17:55:25 +0000 chunk="5f4acaafd238fdba7d32b4b3b001c59f" error_class=Fluent::Papertrail::SocketFailureError error="Error writing to logs3.papertrailapp.com:redacted_port: Operation timed out"
  2023-02-14 17:55:24 +0000 [warn]: #0 /usr/lib/ruby/2.7.0/openssl/buffering.rb:322:in `syswrite'
  2023-02-14 17:55:24 +0000 [warn]: #0 /usr/lib/ruby/2.7.0/openssl/buffering.rb:322:in `do_write'
  2023-02-14 17:55:24 +0000 [warn]: #0 /usr/lib/ruby/2.7.0/openssl/buffering.rb:409:in `puts'
  2023-02-14 17:55:24 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-papertrail-0.2.8/lib/fluent/plugin/out_papertrail.rb:133:in `send_to_papertrail'
  2023-02-14 17:55:24 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-papertrail-0.2.8/lib/fluent/plugin/out_papertrail.rb:45:in `block in write'
  2023-02-14 17:55:24 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/event.rb:315:in `each'
  2023-02-14 17:55:24 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/event.rb:315:in `block in each'
  2023-02-14 17:55:24 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin/buffer/memory_chunk.rb:81:in `open'
  2023-02-14 17:55:24 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin/buffer/memory_chunk.rb:81:in `open'
  2023-02-14 17:55:24 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/event.rb:314:in `each'
  2023-02-14 17:55:24 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-papertrail-0.2.8/lib/fluent/plugin/out_papertrail.rb:39:in `write'
  2023-02-14 17:55:24 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/compat/output.rb:131:in `write'
  2023-02-14 17:55:24 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:1179:in `try_flush'
  2023-02-14 17:55:24 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:1500:in `flush_thread_run'
  2023-02-14 17:55:24 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:499:in `block (2 levels) in start'
  2023-02-14 17:55:24 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
 2023-02-14 17:55:25 +0000 [info]: #0 initializing tcp socket for logs3.papertrailapp.com:redacted
2023-02-14 17:55:25 +0000 [warn]: #0 retry succeeded. chunk_id="5f4acaafd238fdba7d32b4b3b001c59f"

UserNotFound avatar Feb 14 '23 18:02 UserNotFound

I believe we are seeing this behaviour as well. @UserNotFound did you manage to figure out a workaround for your systems in the meantime?

jwreford99 avatar Mar 22 '23 14:03 jwreford99

I believe we are seeing this behaviour as well. @UserNotFound did you manage to figure out a workaround for your systems in the meantime?

@jwreford99 Sorry for the delayed reply -- I had not found a fix other than to never stop logging 😄 I was planning on checking out other Syslog+TLS plugins which might actually be maintained (eg https://github.com/zarqman/fluent-plugin-syslog-tls or https://github.com/fluent-plugins-nursery/fluent-plugin-remote_syslog) before resorting to trying to fix it myself.

I was ecstatic to see your PR, though!!! When I have some time to dedicate to it, I'll give it a try.

Did you glean any information about whether we can expect movement on reviewing/merging your contribution?

UserNotFound avatar Apr 26 '23 21:04 UserNotFound

No afraid not, I did that PR just in case there were some maintainers still somewhat active on this project that might be able to review code but didn't want to write it themselves. We have been running a fork of this for a little while now and it hasn't completely solved all of the issues of dropped connections but it has solved a significant majority (and we think the remaining issues are due to connectivity issues).

I had also looked at other alternatives but found that https://github.com/fluent-plugins-nursery/fluent-plugin-remote_syslog didn't solve our problems appropriately because it batched up log messages of the same severity, leading to out of order logs that were of different severities. I had not been aware of https://github.com/zarqman/fluent-plugin-syslog-tls/tree/master at all, but it looks like it doesn't fulfil our purposes as it is not a buffered output plugin, this means there is no file queue and no concept of retries to PT which (given our experience with PT connectivity issues) is fairly essential. Given that we have ended up sticking with this plugin

jwreford99 avatar Apr 27 '23 09:04 jwreford99