fluent-plugin-opensearch
fluent-plugin-opensearch copied to clipboard
If fluent-plugin-opensearch faied to refresh `@_aws_credentials`, it won't refresh `@_aws_credentials` anymore
(check apply)
- [x] read the contribution guideline
- [ ] (optional) already reported 3rd party upstream repository or mailing list if you use k8s addon or helm charts.
Steps to replicate
There is no reliable steps to replicate.
When it failed to refresh @_aws_credentials
like the following error log:
2024-02-23 22:16:07 +0000 [error]: #0 Unexpected error raised. Stopping the timer. title=:out_opensearch_expire_credentials error_class=RuntimeError error="No valid AWS credentials found."
2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluent-plugin-opensearch-1.1.4/lib/fluent/plugin/out_opensearch.rb:252:in `aws_credentials'
2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluent-plugin-opensearch-1.1.4/lib/fluent/plugin/out_opensearch.rb:353:in `block (2 levels) in configure'
2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluent-plugin-opensearch-1.1.4/lib/fluent/plugin/out_opensearch.rb:351:in `synchronize'
2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluent-plugin-opensearch-1.1.4/lib/fluent/plugin/out_opensearch.rb:351:in `block in configure'
2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/plugin_helper/timer.rb:80:in `on_timer'
2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/cool.io-1.8.0/lib/cool.io/loop.rb:88:in `run_once'
2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/cool.io-1.8.0/lib/cool.io/loop.rb:88:in `run'
2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/plugin_helper/event_loop.rb:93:in `block in start'
2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
It stopped to refresh with dumping the following log:
2024-02-23 22:16:07 +0000 [error]: #0 Timer detached. title=:out_opensearch_expire_credentials
Therefore, it will fail to flush the buffer with The security token included in the request is expired
error message in the future.
FYI: The following is my config, but I don't think this depends on config.
<match apiserver>
@type copy
<store>
@type s3
<!-- skip -->
</store>
<store>
@type opensearch
bulk_message_request_threshold 6m
request_timeout 90s
resurrect_after 5s
reload_connections false
logstash_format true
logstash_prefix apiserver
logstash_dateformat %Y.%m.%d
suppress_type_name true
time_key time
include_tag_key true
tag_key @tag
id_key _hash
remove_keys _hash
<buffer>
@type file
path /var/log/fluent/buffer/os/apiserver
chunk_limit_size 60m
flush_mode interval
flush_interval 10s
flush_at_shutdown true
</buffer>
<endpoint>
url <URL to AWS OpenSearch Service>
region ap-northeast-1
</endpoint>
</store>
</match>
Expected Behavior or What you need to ask
I'm not sure whether this is bug, but I want fluent-plugin-opensearch to refresh @_aws_credentials
at the next refresh_credentials_interval
. I guess AssumeRoleCredentials.new()
failes if a network is unstable. If this happens, fluent-plugin-opensearch stops sending logs. I'm not happy with this.
The reason why fluent-plugin-opensearch stops to refresh @_aws_credentials
is that timer_execute()
removes the timer if its block raises an exeption.
https://github.com/fluent/fluentd/blob/2b4ca5d2927b706c3bdc98ffd0a0b66232bc0b65/lib/fluent/plugin_helper/timer.rb#L84-L85
Using Fluentd and OpenSearch plugin versions
- OS version: Amazon Linux 2
- Bare Metal or within Docker or Kubernetes or others?: Bare Metal
- Fluentd v1.0 or later: fluentd 1.16.3
- OpenSearch plugin version: fluent-plugin-opensearch (1.1.4)
- OpenSearch version (optional): 1.3
- OpenSearch template(s) (optional)
We are running 6 instances with this plugin for about 1 month. We faced this bug in 3 out of 6 instances. Therefore, this isn't rare problem.
It is happening the same to me with the same plugin version
@ashie san, Could you please confirm if there's any update for this issue?
This is similar to #110 , we are experiencing the same issue. In our case, once in a while there is a network timeout in some regions while connecting to sts for the aws token, which raises the error that stops the timer, with no option to recover other than manually restarting the pods.
FYI: My quick and dirty fix https://github.com/aYukiSekiguchi/fluent-plugin-opensearch/commits/dont_stop_refresh_aws_credentials/
You can build and install like the following
$ fluent-gem build fluent-plugin-opensearch.gemspec
$ sudo fluent-gem install fluent-plugin-opensearch
Hi @aYukiSekiguchi, Could you send your patch as a PR? It seems it's one of the good workaround to mitigate this issue.
Sure. I created a PR: https://github.com/fluent/fluent-plugin-opensearch/pull/142
This should be fixed in #142.