fluent-plugin-opensearch connect_write timeout until restarting Fluentd

Describe the bug

Hello,

This is a follow up of fluent/fluentd#1844

Environment:

+50 nodes sending logs to opensearch thought fluentd.
All nodes send only basic systemd logs.

I'm deploying logs of

I observe that "sometimes" ( at a random time ).

Fluentd is not able anymore to contact the opensearch cluster a timeout concerning timeout

All next auto-retry fails by the same way / error ( the other nodes continue at the SAME time, to send successfully their logs.

But the curious things, when i restart fluentd, it begin the shutdown by flushing the buffer and ... it works. whereas all the previous auto-retry fails.

I'm trying many parameters about timeout, but i don't understand why fluentd suddenly say: "i got a timeout while writing to your server" x times ( 1,2 - 40 times ! ) but when i restart it, it works successfully at the this push.

Do you have an idea ?

To Reproduce

I don't know presily how to reproduce it. In my side, the problem occured randomly, not a specific time after start, that's disturbing.

Expected behavior

Logs should be flushed successfuly because on shutdown it works, and my +50 others nodes never fails.

Your Environment

- Fluentd version: 1.16.9
- Package version: 5.0.7-1
- Operating system: Rocky Linux 9
- Kernel version: 4.18.0-553.51.1.el8_10.x86_64

Your Configuration

@include conf.d/*.conf

<filter **>
  @type record_transformer
  enable_ruby true
  <record>
    log_type ${tag}
    server_name "#{Socket.gethostname}"
  </record>
</filter>

<match **>
  @type opensearch
  host xxx
  port 443
  scheme https

  user xxx
  password xxxx

  path /es

  logstash_format true

  ssl_verify true

  request_timeout 300s
  <buffer>
    @type file
    path /var/log/fluent/buffer
    flush_interval 5s
    chunk_limit_size 32m
    total_limit_size 1g
  </buffer>
</match>


in the included files

<source>
  @type systemd
  @id input_systemd
  path /run/log/journal
  tag systemd

 <storage>
    @type local
    path /var/log/fluent/fluentd-systemd.json
  </storage>
</source>

<filter systemd>
  @type grep
  <exclude>
    key _SYSTEMD_UNIT
    pattern /^mega-exporter\.service$/
  </exclude>
</filter>

<filter systemd>
  @type record_transformer
  renew_record true
  keep_keys SYSLOG_IDENTIFIER, MESSAGE
 </filter>
<source>
  @type tail
  tag httpd.access
  path /var/log/httpd/*access_log,/var/www/*/logs/*access_log
  pos_file /var/log/fluent/httpd-access.log.pos
  format apache2 
  path_key log_path
</source>

<source>
  @type tail
  tag httpd.errors
  path /var/log/httpd/*error_log,/var/www/*/logs/*error_log
  pos_file /var/log/td-agent/httpd-error.log.pos
  format apache_error
  path_key log_path
</source>

<filter httpd.errors>
  @type record_transformer
  enable_ruby true
  remove_keys pid
  <record>
    client_ip ${record["client"] ? record["client"].split(":")[0] : nil}
  </record>
</filter>

<filter httpd.**>
  @type record_transformer
  enable_ruby true
  <record>
    domain ${record["log_path"] ? record["log_path"].split('/').last.gsub(/-(access|error)_log$/, '') : nil}
  </record>
</filter>

Your Error Log

2025-05-18 06:38:57 +0200 [warn]: #0 failed to flush the buffer. retry_times=15 next_retry_time=2025-05-18 15:20:41 +0200 chunk="63559b64af2d4b9db721c9907294a3cc" error_class=Fluent::Plugin::OpenSearchOutput::RecoverableRequestFailure error="could not push logs to OpenSearch cluster ({:host=>\"xxx\", :port=>443, :scheme=>\"https\", :user=>\"xxx\", :password=>\"obfuscated\", :path=>\"/"}): connect_write timeout reached"

Additional context

2025-05-18 03:53:21 +0200 [info]: #0 flushing all buffer forcedly

do not fix the issue.

May 18 '25 08:05 henri9813

@henri9813 Thanks for your report!

But the curious things, when i restart fluentd, it begin the shutdown by flushing the buffer and ... it works. whereas all the previous auto-retry fails.

Could you please tell me which of the following is it?

Flush succeeds when Fluentd is stopping.
Flush succeeds when Fluentd is starting.

May 20 '25 01:05 daipom

@henri9813 Thanks for your report!

But the curious things, when i restart fluentd, it begin the shutdown by flushing the buffer and ... it works. whereas all the previous auto-retry fails.

Could you please tell me which of the following is it?
* Flush succeeds when Fluentd is stopping.
* Flush succeeds when Fluentd is starting.

If it is the latter, reconnecting may restore the communication. We should consider adding a feature to the out_opeansearch plugin to reconnect without having to restart Fluentd.

May 20 '25 01:05 daipom

Hello,

Good question ! I think it's both, when i do systemctl restart the flush is in progress, but don't know precisly.

Often, i do it very quickyl because fluentd.log became very fat ( ~1GB / hour ) due to lack of space in buffer.

May 20 '25 12:05 henri9813

I think it's both, when i do systemctl restart the flush is in progress, but don't know precisely.

I think it's probably the latter (Flush succeeds when Fluentd is starting). Because you don't use flush_at_shutdown option.

So, reconnecting would be necessary to restore the communication. You can use reconnect_on_error option of opensearch plugin! Please try this option!

May 21 '25 01:05 daipom

@henri9813 this issue was automatically closed because it did not follow the issue template.

May 21 '25 01:05 github-actions[bot]

I have moved this issue to fluent-plugin-opensearch repo.

May 21 '25 01:05 daipom

Hello,

It seems ok.

Thanks for the documentation, i found a post mentioning

reconnect_on_error true
reload_on_failure true
reload_connections false

which i used without any further research and it works.

But when i look your documentation, something is disturbing

Indicates that the plugin should reset connection on any error (reconnect on next send). By default it will reconnect only on "host unreachable exceptions". We recommended to set this true in the presence of opensearch shield.

Then, why this option is not true by default ?

What is the advantages of keeping it off by default ?

Best regards

May 21 '25 08:05 henri9813

This is because opensearch is spinned up under the AWS managed environment. So, if it were enabled by default, many of the users complained for the default settings due to sniffering feature from opensearch-ruby that was originally created by elasticsearch-ruby. So, we chose that that parameter's value should be false.

May 21 '25 08:05 cosmo0920

Hello,

OK, i think it's a bit sad to determine a parameter by thinking Opensearch = AWS and not. "Opensearch is opensource, backed by AWS team but can. be hosted everywhere".

I do not use Opensearch under AWS ;-)

We have our own infrastructures & DC .

Best regards.

May 21 '25 10:05 henri9813

So, you might have to turn on reload_connections parameter. Most of users of OpenSearch plugin use it under AWS OpenSearch. So, we just provide this plugin for the most use cases.

May 21 '25 10:05 cosmo0920