connect_write timeout until restarting Fluentd
Describe the bug
Hello,
This is a follow up of fluent/fluentd#1844
Environment:
- +50 nodes sending logs to opensearch thought fluentd.
- All nodes send only basic systemd logs.
I'm deploying logs of
I observe that "sometimes" ( at a random time ).
Fluentd is not able anymore to contact the opensearch cluster a timeout concerning timeout
All next auto-retry fails by the same way / error ( the other nodes continue at the SAME time, to send successfully their logs.
But the curious things, when i restart fluentd, it begin the shutdown by flushing the buffer and ... it works. whereas all the previous auto-retry fails.
I'm trying many parameters about timeout, but i don't understand why fluentd suddenly say: "i got a timeout while writing to your server" x times ( 1,2 - 40 times ! ) but when i restart it, it works successfully at the this push.
Do you have an idea ?
To Reproduce
I don't know presily how to reproduce it. In my side, the problem occured randomly, not a specific time after start, that's disturbing.
Expected behavior
Logs should be flushed successfuly because on shutdown it works, and my +50 others nodes never fails.
Your Environment
- Fluentd version: 1.16.9
- Package version: 5.0.7-1
- Operating system: Rocky Linux 9
- Kernel version: 4.18.0-553.51.1.el8_10.x86_64
Your Configuration
@include conf.d/*.conf
<filter **>
@type record_transformer
enable_ruby true
<record>
log_type ${tag}
server_name "#{Socket.gethostname}"
</record>
</filter>
<match **>
@type opensearch
host xxx
port 443
scheme https
user xxx
password xxxx
path /es
logstash_format true
ssl_verify true
request_timeout 300s
<buffer>
@type file
path /var/log/fluent/buffer
flush_interval 5s
chunk_limit_size 32m
total_limit_size 1g
</buffer>
</match>
in the included files
<source>
@type systemd
@id input_systemd
path /run/log/journal
tag systemd
<storage>
@type local
path /var/log/fluent/fluentd-systemd.json
</storage>
</source>
<filter systemd>
@type grep
<exclude>
key _SYSTEMD_UNIT
pattern /^mega-exporter\.service$/
</exclude>
</filter>
<filter systemd>
@type record_transformer
renew_record true
keep_keys SYSLOG_IDENTIFIER, MESSAGE
</filter>
<source>
@type tail
tag httpd.access
path /var/log/httpd/*access_log,/var/www/*/logs/*access_log
pos_file /var/log/fluent/httpd-access.log.pos
format apache2
path_key log_path
</source>
<source>
@type tail
tag httpd.errors
path /var/log/httpd/*error_log,/var/www/*/logs/*error_log
pos_file /var/log/td-agent/httpd-error.log.pos
format apache_error
path_key log_path
</source>
<filter httpd.errors>
@type record_transformer
enable_ruby true
remove_keys pid
<record>
client_ip ${record["client"] ? record["client"].split(":")[0] : nil}
</record>
</filter>
<filter httpd.**>
@type record_transformer
enable_ruby true
<record>
domain ${record["log_path"] ? record["log_path"].split('/').last.gsub(/-(access|error)_log$/, '') : nil}
</record>
</filter>
Your Error Log
2025-05-18 06:38:57 +0200 [warn]: #0 failed to flush the buffer. retry_times=15 next_retry_time=2025-05-18 15:20:41 +0200 chunk="63559b64af2d4b9db721c9907294a3cc" error_class=Fluent::Plugin::OpenSearchOutput::RecoverableRequestFailure error="could not push logs to OpenSearch cluster ({:host=>\"xxx\", :port=>443, :scheme=>\"https\", :user=>\"xxx\", :password=>\"obfuscated\", :path=>\"/"}): connect_write timeout reached"
Additional context
2025-05-18 03:53:21 +0200 [info]: #0 flushing all buffer forcedly
do not fix the issue.
@henri9813 Thanks for your report!
But the curious things, when i restart fluentd, it begin the shutdown by flushing the buffer and ... it works. whereas all the previous auto-retry fails.
Could you please tell me which of the following is it?
- Flush succeeds when Fluentd is stopping.
- Flush succeeds when Fluentd is starting.
@henri9813 Thanks for your report!
But the curious things, when i restart fluentd, it begin the shutdown by flushing the buffer and ... it works. whereas all the previous auto-retry fails.
Could you please tell me which of the following is it?
* Flush succeeds when Fluentd is stopping. * Flush succeeds when Fluentd is starting.
If it is the latter, reconnecting may restore the communication.
We should consider adding a feature to the out_opeansearch plugin to reconnect without having to restart Fluentd.
Hello,
Good question ! I think it's both, when i do systemctl restart the flush is in progress, but don't know precisly.
Often, i do it very quickyl because fluentd.log became very fat ( ~1GB / hour ) due to lack of space in buffer.
I think it's both, when i do systemctl restart the flush is in progress, but don't know precisely.
I think it's probably the latter (Flush succeeds when Fluentd is starting). Because you don't use flush_at_shutdown option.
So, reconnecting would be necessary to restore the communication.
You can use reconnect_on_error option of opensearch plugin!
Please try this option!
@henri9813 this issue was automatically closed because it did not follow the issue template.
I have moved this issue to fluent-plugin-opensearch repo.
Hello,
It seems ok.
Thanks for the documentation, i found a post mentioning
reconnect_on_error true
reload_on_failure true
reload_connections false
which i used without any further research and it works.
But when i look your documentation, something is disturbing
Indicates that the plugin should reset connection on any error (reconnect on next send). By default it will reconnect only on "host unreachable exceptions". We recommended to set this true in the presence of opensearch shield.
Then, why this option is not true by default ?
What is the advantages of keeping it off by default ?
Best regards
This is because opensearch is spinned up under the AWS managed environment. So, if it were enabled by default, many of the users complained for the default settings due to sniffering feature from opensearch-ruby that was originally created by elasticsearch-ruby. So, we chose that that parameter's value should be false.
Hello,
OK, i think it's a bit sad to determine a parameter by thinking Opensearch = AWS and not. "Opensearch is opensource, backed by AWS team but can. be hosted everywhere".
I do not use Opensearch under AWS ;-)
We have our own infrastructures & DC .
Best regards.
So, you might have to turn on reload_connections parameter.
Most of users of OpenSearch plugin use it under AWS OpenSearch. So, we just provide this plugin for the most use cases.