beats icon indicating copy to clipboard operation
beats copied to clipboard

- Bug/Enhancement: Update metricbeat test modules timeout to be configurable

Open DanH-Semplicity opened this issue 6 months ago • 9 comments

The timeout when running metricbeat test modules <my_module> was intermittently giving an error "ERROR timeout waiting for an event". It seemed to be intermittent since our module was taking between 5 to 6 seconds to respond.

Unfortunately we discovered the timeout was hardcoded to 5 seconds and not configurable. This fix that we tested, will adjust the timeout to be now configurable by setting http timeout, it also appears to default to period if timeout is unset.

Reference Links: https://www.elastic.co/docs/reference/beats/metricbeat/command-line-options#test-command https://www.elastic.co/docs/reference/beats/metricbeat/configuration-metricbeat#_timeout https://www.elastic.co/docs/reference/beats/metricbeat/command-line-options#test-command

Please label this PR with one of the following labels, depending on the scope of your change:

  • Bug
  • Enhancement

Proposed commit message

Update metricbeat test modules timeout to be configurable

Explain here the changes you made on the PR. Updated the command metricbeat test modules <my_module> timeout value instead of heard-coded to be set to the http timeout value.

Please explain:

  • WHAT: patterns used, algorithms implemented, design architecture, message processing, etc. N/A it is simple fix, one line of code.

  • WHY: the rationale/motivation for the changes We were getting this intermittent error when demoing the testing of modules to client and it made metribeat look odd with random event failures, that were due to hard coded timeout.

See Title

Checklist

  • [x ] My code follows the style guidelines of this project
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

No, it should have no impact.

Author's Checklist

Review single line code change, it was tested and it works as described and is now configurable.

How to test this PR locally

Go into any metricbeat/module.d/my_module.yml that uses http/https and set the yml file timeout at 1s and you should see an error, that we were encountering.

Related issues

None:

Use cases

Command: metricbeat test modules my_module -v -e Test: Prior to code integration, test hardcoded timeout with response time of an http API request with greater than 5 seconds, then build metricbeat based on this one line code change and it should default my_module.yml time period, if timeout is unset, so you can set period to 1s and then it should fail, adjust period to 10s it should work, then set timeout to 1s it should fail, and adjust to 10s and it should work.

Screenshots

N/A

Logs

metricbeat test modules my_module -v -e
{"log.level":"info","@timestamp":"2025-05-18T12:27:29.940-0600","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).configure","file.name":"instance/beat.go","file.line":1062},"message":"Home path: [/Users/danh/github/danh/beats/metricbeat] Config path: [/Users/danh/github/danh/beats/metricbeat] Data path: [/Users/danh/github/danh/beats/metricbeat/data] Logs path: [/Users/danh/github/danh/beats/metricbeat/logs]","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-05-18T12:27:29.940-0600","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).configure","file.name":"instance/beat.go","file.line":1070},"message":"Beat ID: cc12c183-8e7d-40ce-8e7d-5cfd9395df71","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2025-05-18T12:27:30.102-0600","log.logger":"cfgwarn","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/my_module/health.New","file.name":"health/health.go","file.line":65},"message":"BETA: The my_module health metricset is beta.","service.name":"metricbeat","ecs.version":"1.6.0"}
my_module...
  health...

    error... ERROR timeout waiting for an event

DanH-Semplicity avatar May 18 '25 19:05 DanH-Semplicity

:robot: GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

github-actions[bot] avatar May 18 '25 19:05 github-actions[bot]

This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @DanH-Semplicity? 🙏. For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

mergify[bot] avatar May 18 '25 19:05 mergify[bot]

It would be nice if this was back ported to 8.17, we first encountered the error on 8.17.

DanH-Semplicity avatar May 18 '25 19:05 DanH-Semplicity

@DanH-Semplicity, thanks for the PR!

Indeed it makes the test timeout configurable, however I'm not sure using the HTTP timeout as the test timeout is the best option here. Aside from the double meaning for the timeout configuration key, it also prevents the HTTP timeout and the test timeout to be set at the same time and as different values.

I believe the best option is to introduce a new configuration option, however before changing anything on your PR, let me get more feed back from the team.

belimawr avatar May 19 '25 19:05 belimawr

Hi @belimawr,

Interesting, I was thinking of testing the http case timeout, where existing time-outs and defaults were set and should be honored. However I think your indicating a more generic module testing scenerio where perhaps it should be passed as parameter to the test command where it can be configured on test modules invocation (and I would suggest defaults to 10 seconds). https://www.elastic.co/docs/reference/beats/metricbeat/command-line-options#test-command. Let me know what your team comes up with. Either sounds fine to me, but I agree your proposed resolution might be better for non http scenario, since I believe timeout only applies to http and not standard config options.

Thanks Dan

DanH-Semplicity avatar May 19 '25 20:05 DanH-Semplicity

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

elasticmachine avatar May 20 '25 13:05 elasticmachine

Hi @belimawr,

Interesting, I was thinking of testing the http case timeout, where existing time-outs and defaults were set and should be honored. However I think your indicating a more generic module testing scenerio where perhaps it should be passed as parameter to the test command where it can be configured on test modules invocation (and I would suggest defaults to 10 seconds). https://www.elastic.co/docs/reference/beats/metricbeat/command-line-options#test-command. Let me know what your team comes up with. Either sounds fine to me, but I agree your proposed resolution might be better for non http scenario, since I believe timeout only applies to http and not standard config options.

Thanks Dan

Based on where this timeout is in the code, any module is bound by it, not only the ones that make HTTP requests, hence the more generic approach.

@rdner suggestion, even though is more complex, is very interesting and allows for easily setting the timeout via CLI.

belimawr avatar May 21 '25 12:05 belimawr

@belimawr and @rdner, I think @rdner purposed solution is a better overall implementation, due to solving a timer being passed as a parm for all modules (instead of http only). Do you want me to generate the change on this PR or open separate PR to fix? Thanks Dan

DanH-Semplicity avatar May 21 '25 21:05 DanH-Semplicity

@belimawr and @rdner, I think @rdner purposed solution is a better overall implementation, due to solving a timer being passed as a parm for all modules (instead of http only). Do you want me to generate the change on this PR or open separate PR to fix? Thanks Dan

On this PR is fine.

belimawr avatar May 22 '25 14:05 belimawr