beats icon indicating copy to clipboard operation
beats copied to clipboard

[metricbeat] - Allow metricsets to report their status via v2 protocol

Open VihasMakwana opened this issue 1 year ago • 3 comments

Proposed commit message

Add StatusReporter for letting individual metricsets to report their statuses. Add support for StatusReporter for system/process integration.

Checklist

  • [x] My code follows the style guidelines of this project
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [x] I have added tests that prove my fix is effective or that my feature works
  • [x] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

  • Closes https://github.com/elastic/beats/issues/39736
  • Closes https://github.com/elastic/beats/issues/39737

Use cases

Screenshots

Logs

VihasMakwana avatar Jun 26 '24 12:06 VihasMakwana

This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @VihasMakwana? 🙏. For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

mergify[bot] avatar Jun 26 '24 12:06 mergify[bot]

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

elasticmachine avatar Jun 26 '24 16:06 elasticmachine

This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b add-metricbeat-status-reporter upstream/add-metricbeat-status-reporter
git merge upstream/main
git push upstream add-metricbeat-status-reporter

mergify[bot] avatar Jul 01 '24 22:07 mergify[bot]

@cmacknz with Fae & Alex out, could you please review here? If you think you won't have time I will find someone else.

pierrehilbert avatar Jul 08 '24 12:07 pierrehilbert

I think you want to set the status on the return value from Fetch in this block as well, this seems to be the common way to report errors when an event couldn't be generated at all:

https://github.com/elastic/beats/blob/032a4cfd5f3b8fa8354ac1e0062a0e1f196c60d0/metricbeat/mb/module/wrapper.go#L251-L264

If I search for the Error fetching PID info for in the original issue description of https://github.com/elastic/beats/issues/39737 it leads me to here which doesn't actually generate an event: https://github.com/elastic/beats/blob/032a4cfd5f3b8fa8354ac1e0062a0e1f196c60d0/metricbeat/module/system/process/process.go#L128-L135

To me it seems like we need to catch the return value from Fetch but maybe I am missing something here.

cmacknz avatar Jul 09 '24 20:07 cmacknz

@cmacknz If you have a look https://github.com/elastic/beats/blob/6e88ac5c0335822d0b2c1fc2eccc96213747b365/metricbeat/mb/module/wrapper.go#L263 this Error() call would result in an event with an Error: err set, it happens https://github.com/elastic/beats/blob/6e88ac5c0335822d0b2c1fc2eccc96213747b365/metricbeat/mb/module/wrapper.go#L357

If I search for the Error fetching PID info for in the original issue description of https://github.com/elastic/beats/issues/39737 it leads me to here which doesn't actually generate an event:

I have added a test case for that specific scenario. It works as expected and status is changed to Degraded.

Alternatively, we can do as you mentioned and keep the UpdateStatus after Fetch.

Also, I've made the error in statusReporter more descriptive.

VihasMakwana avatar Jul 09 '24 21:07 VihasMakwana

Thanks I wasn't following that. I think it would be a more obvious if we set state based on the return value of Fetch (certainly it would be for me). I think it also has more obvious behavior with respect to when the status will be cleared - it will be cleared on the first call to Fetch that doesn't return an error.

I think setting module status based on the presence Error in the event can catch more things, but conceptually it doesn't entirely feel right to me that an error in an event means the entire module needs to be marked as degraded. I would rather it clearly indicate "module cannot collect data" which is what the return value from Fetch tells us, where as the Error field on the event is a bit more ambiguous. Maybe we generated the event, and some non-fatal error occurred so the event is not as complete as it could be, but the module is mostly fine.

cmacknz avatar Jul 10 '24 20:07 cmacknz

@cmacknz If you can have a look now, I've added the status reporter on Fetch() return.

VihasMakwana avatar Jul 12 '24 16:07 VihasMakwana

This looks good, but I don't see any actual invocations of SetStatusReporter, which seems to be required for the other changes to have an effect. Am I missing something, or is that coming in a later PR?

@fae, This part was introduced by https://github.com/elastic/beats/pull/39209 and it's done at https://github.com/elastic/beats/blob/1a6318d215517feb4c5c880619db8589d926956b/libbeat/cfgfile/list.go#L157-L161 in a generic way for all the runners.

VihasMakwana avatar Jul 15 '24 19:07 VihasMakwana

Thanks I wasn't following that. I think it would be a more obvious if we set state based on the return value of Fetch (certainly it would be for me). I think it also has more obvious behavior with respect to when the status will be cleared - it will be cleared on the first call to Fetch that doesn't return an error.

@cmacknz Let me know how it looks to you now and if you have any more comments regarding this.


@faec Thanks for reviewing!

VihasMakwana avatar Jul 16 '24 17:07 VihasMakwana

The changes themselves LGTM, but I am not getting the result I expected when I test this manually. What I did:

  1. Create an ubuntu VM with multipass and an 8.15.0-SNAPSHOT ESS deployment on staging.elastic.co.
  2. Build agentbeat from this branch with SNAPSHOT=true PLATFORMS=linux/arm64 PACKAGES=targz mage package
  3. Transfer agentbeat into the VM: multipass transfer ./build/distributions/agentbeat-8.16.0-SNAPSHOT-linux-arm64.tar.gz $VM_NAME:/home/ubuntu
  4. Shell into the VM with multipass shell
  5. Download latest 8.15.0 BC agent: curl -LO https://staging.elastic.co/8.15.0-a47a2107/downloads/beats/elastic-agent/elastic-agent-8.15.0-linux-arm64.tar.gz
  6. Extract the agent: tar xvfz elastic-agent-8.15.0-linux-arm64.tar.gz
  7. Overwrite the packaged agentbeat with the one from this branch: cp ./agentbeat-8.16.0-SNAPSHOT-linux-arm64/agentbeat ./elastic-agent-8.15.0-linux-arm64/data/elastic-agent-475d24/components/
  8. Install the agent as unprivileged using the default Fleet policy to collect system metrics: ./elastic-agent install --url=X --enrollment-token=Y --unprivileged -f
  9. Create a simple script and run it as root to guarantee there is some PID running under root that we should fail to get metrics from.
~$ cat script.sh
while true;
do
        echo "Hello!";
        sleep 1;
done
~$ sudo ./script.sh
  1. Observe agent remains healthy but the logs show privilege related errors, many of which are only visible at debug level.
{"log.level":"debug","@timestamp":"2024-07-16T21:04:30.148Z","message":"Non-fatal error fetching PID metrics for 30358, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30358/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:04:40.143Z","message":"Non-fatal error fetching PID metrics for 30357, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30357/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:05:50.143Z","message":"Non-fatal error fetching PID metrics for 30358, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30358/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:05:50.143Z","message":"Error fetching PID info for 30642, skipping: GetInfoForPid: no such process","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":198,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidIter"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.055Z","message":"Non-fatal error fetching PID metrics for 30356, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30356/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30357, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30357/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30358, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30358/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30649, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30649/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30650, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30650/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30651, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30651/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30652, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30652/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30670, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30670/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","ecs.version":"1.6.0"}

I've attached diagnostics from the VM. The logs indicating the error is non-fatal might be what is catching us here, I think this has changed since the issue to implement this was created. Perhaps you were initially right hooking into the error in the event, which might catch these since they are non-fatal, but I still think looking at events makes it harder to see when we will go back to healthy. Regardless, this isn't actually catching the degraded state we want.

unpriv-metricbeat-diagnostics.zip

I also wonder if we could get a test to catch this, assuming I verified this correctly.

cmacknz avatar Jul 16 '24 21:07 cmacknz

@cmacknz I see.

I think this is due to the fact that elastic-agent-system-metrics only logs the error and doesn't return them to the caller when monitoring set of processes. Example https://github.com/elastic/elastic-agent-system-metrics/blob/a9ecc37f3de53d18a85ad182718010650aad774b/metric/system/process/process.go#L263-L269

We can make use of multierr to capture the errors encountered for set of processes and pass it to the caller.

VihasMakwana avatar Jul 17 '24 10:07 VihasMakwana

This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b add-metricbeat-status-reporter upstream/add-metricbeat-status-reporter
git merge upstream/main
git push upstream add-metricbeat-status-reporter

mergify[bot] avatar Jul 17 '24 15:07 mergify[bot]

This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b add-metricbeat-status-reporter upstream/add-metricbeat-status-reporter
git merge upstream/main
git push upstream add-metricbeat-status-reporter

mergify[bot] avatar Jul 23 '24 19:07 mergify[bot]

@cmacknz Screenshot of error event on Kibana:

Screenshot 2024-07-24 at 3 00 33 PM

VihasMakwana avatar Jul 24 '24 09:07 VihasMakwana

Merging this as of now. https://github.com/elastic/beats/pull/40298 will take care of the non-fatal errors.

VihasMakwana avatar Jul 24 '24 09:07 VihasMakwana

⚠️ The sha of the head commit of this PR conflicts with #40400. Mergify cannot evaluate rules on this PR. ⚠️

mergify[bot] avatar Jul 31 '24 12:07 mergify[bot]