fluent-plugin-opensearch icon indicating copy to clipboard operation
fluent-plugin-opensearch copied to clipboard

If fluent-plugin-opensearch faied to refresh `@_aws_credentials`, it won't refresh `@_aws_credentials` anymore

Open aYukiSekiguchi opened this issue 11 months ago • 7 comments

(check apply)

  • [x] read the contribution guideline
  • [ ] (optional) already reported 3rd party upstream repository or mailing list if you use k8s addon or helm charts.

Steps to replicate

There is no reliable steps to replicate.

When it failed to refresh @_aws_credentials like the following error log:

2024-02-23 22:16:07 +0000 [error]: #0 Unexpected error raised. Stopping the timer. title=:out_opensearch_expire_credentials error_class=RuntimeError error="No valid AWS credentials found."
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluent-plugin-opensearch-1.1.4/lib/fluent/plugin/out_opensearch.rb:252:in `aws_credentials'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluent-plugin-opensearch-1.1.4/lib/fluent/plugin/out_opensearch.rb:353:in `block (2 levels) in configure'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluent-plugin-opensearch-1.1.4/lib/fluent/plugin/out_opensearch.rb:351:in `synchronize'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluent-plugin-opensearch-1.1.4/lib/fluent/plugin/out_opensearch.rb:351:in `block in configure'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/plugin_helper/timer.rb:80:in `on_timer'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/cool.io-1.8.0/lib/cool.io/loop.rb:88:in `run_once'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/cool.io-1.8.0/lib/cool.io/loop.rb:88:in `run'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/plugin_helper/event_loop.rb:93:in `block in start'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

It stopped to refresh with dumping the following log:

2024-02-23 22:16:07 +0000 [error]: #0 Timer detached. title=:out_opensearch_expire_credentials

Therefore, it will fail to flush the buffer with The security token included in the request is expired error message in the future.

FYI: The following is my config, but I don't think this depends on config.

<match apiserver>
  @type copy
  <store>
    @type s3
    <!-- skip -->
  </store>
  <store>
    @type opensearch
    bulk_message_request_threshold 6m
    request_timeout 90s
    resurrect_after 5s
    reload_connections false
    logstash_format true
    logstash_prefix apiserver
    logstash_dateformat %Y.%m.%d
    suppress_type_name true
    time_key time
    include_tag_key true
    tag_key @tag
    id_key _hash
    remove_keys _hash
    <buffer>
      @type file
      path /var/log/fluent/buffer/os/apiserver
      chunk_limit_size 60m
      flush_mode interval
      flush_interval 10s
      flush_at_shutdown true
    </buffer>
    <endpoint>
      url <URL to AWS OpenSearch Service>
      region ap-northeast-1
    </endpoint>
  </store>
</match>

Expected Behavior or What you need to ask

I'm not sure whether this is bug, but I want fluent-plugin-opensearch to refresh @_aws_credentials at the next refresh_credentials_interval. I guess AssumeRoleCredentials.new() failes if a network is unstable. If this happens, fluent-plugin-opensearch stops sending logs. I'm not happy with this.

The reason why fluent-plugin-opensearch stops to refresh @_aws_credentials is that timer_execute() removes the timer if its block raises an exeption. https://github.com/fluent/fluentd/blob/2b4ca5d2927b706c3bdc98ffd0a0b66232bc0b65/lib/fluent/plugin_helper/timer.rb#L84-L85

Using Fluentd and OpenSearch plugin versions

  • OS version: Amazon Linux 2
  • Bare Metal or within Docker or Kubernetes or others?: Bare Metal
  • Fluentd v1.0 or later: fluentd 1.16.3
  • OpenSearch plugin version: fluent-plugin-opensearch (1.1.4)
  • OpenSearch version (optional): 1.3
  • OpenSearch template(s) (optional)

aYukiSekiguchi avatar Feb 27 '24 01:02 aYukiSekiguchi

We are running 6 instances with this plugin for about 1 month. We faced this bug in 3 out of 6 instances. Therefore, this isn't rare problem.

aYukiSekiguchi avatar Mar 01 '24 06:03 aYukiSekiguchi

It is happening the same to me with the same plugin version

davidpsv17 avatar May 07 '24 10:05 davidpsv17

@ashie san, Could you please confirm if there's any update for this issue?

akhil31415 avatar May 22 '24 11:05 akhil31415

This is similar to #110 , we are experiencing the same issue. In our case, once in a while there is a network timeout in some regions while connecting to sts for the aws token, which raises the error that stops the timer, with no option to recover other than manually restarting the pods.

ntopee avatar Aug 26 '24 10:08 ntopee

FYI: My quick and dirty fix https://github.com/aYukiSekiguchi/fluent-plugin-opensearch/commits/dont_stop_refresh_aws_credentials/

You can build and install like the following

$ fluent-gem build fluent-plugin-opensearch.gemspec
$ sudo fluent-gem install fluent-plugin-opensearch

aYukiSekiguchi avatar Aug 26 '24 10:08 aYukiSekiguchi

Hi @aYukiSekiguchi, Could you send your patch as a PR? It seems it's one of the good workaround to mitigate this issue.

cosmo0920 avatar Aug 29 '24 08:08 cosmo0920

Sure. I created a PR: https://github.com/fluent/fluent-plugin-opensearch/pull/142

aYukiSekiguchi avatar Sep 02 '24 12:09 aYukiSekiguchi

This should be fixed in #142.

cosmo0920 avatar Oct 02 '24 08:10 cosmo0920