beats Changed collection method of the second counter value in the Windows Perfmon module of Metricbeat

What does this PR do?

There are two ways of collecting performance counters in Windows, as explained here.

The current Metricbeat implementation uses the one where PdhCollectQueryDataEx is called to retrieve the second raw value that is required to calculate the displayable value.

This PR changes this collection method to a simpler one, the one that calls PdhCollectQueryData twice, with a sleep of one second in between.

The function randSeq was implemented in an earlier version in order to support the current implementation, but is not used anymore. The deletion of this function is therefore just considered as cleanup.

Why is it important?

Collection of the counter "Processor(_Total) % Processor Time" gave false 100 % spikes. Testing on different computers and servers showed that when the sampling frequency, or period, was larger than one second, one would obtain these false 100 % values. They have been checked for being false by comparing with Windows own Performance Monitor. When running several instances of Metricbeat alongside each other, only the ones where the current collection method is implemented, and the period is larger than one second, measures these spikes.

This is part of one configuration file used when testing, where the sampling frequency/period is set to 2 seconds:

metricbeat.modules:

module: windows enabled: true metricsets:

perfmon period: 2s perfmon.ignore_non_existent_counters: true perfmon.group_measurements_by_instance: true perfmon.queries:

object: Processor instance:

'_Total' counters:

name: '% Processor Time'

In the two figures below (Figfure (1) and Figure (2)), the current implementation (named "thread"), is compared with the implementation of this PR (named "sleep"), for different sampling frequencies/periods. The names "sleep1s" and "thread1s" indicates that the collection interval, the time between doing a PdhCollectQueryData call, is hardcoded to 1s in the version.

In Figure (1), each point represents a maximum value over a period of 60 seconds, and explains why the different versions with different frequencies show different values, except for sleep1s-freq1s and thread1s-freq1s, which provide almost identical values (as expected). What this figure shows is that the thread1s-freq2s and thread1s-freq10s produce false 100 % spikes.

Zooming in on one of the 100 % spikes has been done in Figure (2), where thread1s-freq10s provides one 100 % value that is most certainly false. In addition, thread1s-freq2s produces 0 % values that does not seem correct either.

Figure (1) thread1s-vs-sleep1s

Figure (2) thread1s-vs-sleep1s-zoom

The figures above show tests performed using Metricbeat 7.17.2, but they have also been reproduced using Metricbeat 8.3.1. They have also been reproduced on different computers and servers, a database server for example. The code in question is the same in both versions, so this is also expected.

As one also can see from the figures above, and also generally in all the experiments done, the sleep method always produces values that seem correct and that conform with what Windows Performance Monitor is also observing. The problem only arises when the sampling frequency/period is larger than the collection interval (tests have also been made with larger collection intervals, versions of Metricbeat called "sleep2s" and "thread2s", etc.), for the current collection method. It therefore seems like there is some bug concerning the sampling frequency/period of Metricbeat in combination with the creating of the thread. This PR does not explain what this issue is precisely, but simply provides an alternative where this problem does not arise.

In order to get a closer look at what was going on, tests were run where the raw values were fetched after calling the PdhCollectQueryData function. When using either the sleep method, or the thread method with a sampling frequency/period that is not larger than the collection interval (1 second), these raw values were collected at the correct times, and the values could be used to verify the fetched displayable value. Below are some examples of logs of two consecutive samples for these cases. An entry saying "Raw values: \Processor(_Total)\% Processor Time:[{_Total 5870585175781 133017943902193928 { 0}}]])", means that the FirstValue of the raw value is 5870585175781 and the SecondValue is 133017943902193928. The displayable value is calculated using taking two consecutive raw values, and then take one minus the difference of the FirstValues divided by the difference of the SecondValues, and at the end multiply by 100 % to get the percent value. The formula is also found here. In the below examples, this adds up in all cases.

sleep1s-freq1s

First sample: {"@timestamp":"2022-07-09T00:53:10.219+0200","message":"Raw values: \Processor(_Total)\% Processor Time:[{_Total 5870585175781 133017943902193928 { 0}}]])"} {"@timestamp":"2022-07-09T00:53:11.223+0200","message":"Formatted values: \Processor(_Total)\% Processor Time:[{_Total 1.5947397042254785 { 0}}]])"} {"@timestamp":"2022-07-09T00:53:11.223+0200","message":"Raw values: \Processor(_Total)\% Processor Time:[{_Total 5870595048828 133017943912226976 { 0}}]])"} Consecutive sample: {"@timestamp":"2022-07-09T00:53:11.224+0200","message":"Raw values: \Processor(_Total)\% Processor Time:[{_Total 5870595058593 133017943912239452 { 0}}]])"} {"@timestamp":"2022-07-09T00:53:12.237+0200","message":"Formatted values: \Processor(_Total)\% Processor Time:[{_Total 4.005576157731239 { 0}}]])"} {"@timestamp":"2022-07-09T00:53:12.237+0200","message":"Raw values: \Processor(_Total)\% Processor Time:[{_Total 5870604785156 133017943922371877 { 0}}]])"}

sleep1s-freq10s

First sample: {"@timestamp":"2022-07-08T23:46:47.100+0200","message":"Raw values: \Processor(_Total)\% Processor Time:[{_Total 5831721962890 133017904071006917 { 0}}]])"} {"@timestamp":"2022-07-08T23:46:48.116+0200","message":"Formatted values: \Processor(_Total)\% Processor Time:[{_Total 2.633179634240257 { 0}}]])"} {"@timestamp":"2022-07-08T23:46:48.116+0200","message":"Raw values: \Processor(_Total)\% Processor Time:[{_Total 5831731855468 133017904081167029 { 0}}]])"} Consecutive sample: {"@timestamp":"2022-07-08T23:46:57.092+0200","message":"Raw values: \Processor(_Total)\% Processor Time:[{_Total 5831819648437 133017904170920927 { 0}}]])"} {"@timestamp":"2022-07-08T23:46:58.101+0200","message":"Formatted values: \Processor(_Total)\% Processor Time:[{_Total 2.156219690300587 { 0}}]])"} {"@timestamp":"2022-07-08T23:46:58.101+0200","message":"Raw values: \Processor(_Total)\% Processor Time:[{_Total 5831829521484 133017904181011550 { 0}}]])"}

thread1s-freq1s

First sample: {"@timestamp":"2022-07-09T00:18:04.609+0200","message":"Raw values: \Processor(_Total)\% Processor Time:[{_Total 5850016337890 133017922846099090 { 0}}]])"} {"@timestamp":"2022-07-09T00:18:05.619+0200","message":"Formatted values: \Processor(_Total)\% Processor Time:[{_Total 3.05911239227572 { 0}}]])"} {"@timestamp":"2022-07-09T00:18:05.619+0200","message":"Raw values: \Processor(_Total)\% Processor Time:[{_Total 5850026123046 133017922856193031 { 0}}]])"} Consecutive sample: {"@timestamp":"2022-07-09T00:18:05.619+0200","message":"Raw values: \Processor(_Total)\% Processor Time:[{_Total 5850026132812 133017922856196130 { 0}}]])"} {"@timestamp":"2022-07-09T00:18:06.646+0200","message":"Formatted values: \Processor(_Total)\% Processor Time:[{_Total 3.3303272308510157 { 0}}]])"} {"@timestamp":"2022-07-09T00:18:06.646+0200","message":"Raw values: \Processor(_Total)\% Processor Time:[{_Total 5850036044921 133017922866449717 { 0}}]])"}

It is, however, not the case when using the current implementation (thread method), with a sampling frequency. Both raw value samples are identical, and hence one does not obtain the fetched displayable value when plugging them into the formula. Here is one example of such logs:

thread1s-freq10s

First sample: {"@timestamp":"2022-07-08T23:19:38.104+0200","message":"Raw values: \Processor(_Total)\% Processor Time:[{_Total 5815925039062 133017887781042404 { 0}}]])"} {"@timestamp":"2022-07-08T23:19:38.105+0200","message":"Formatted values: \Processor(_Total)\% Processor Time:[{_Total 0.5117438823267872 { 0}}]])"} {"@timestamp":"2022-07-08T23:19:38.105+0200","message":"Raw values: \Processor(_Total)\% Processor Time:[{_Total 5815925039062 133017887781042404 { 0}}]])"} Consecutive sample: {"@timestamp":"2022-07-08T23:19:48.095+0200","message":"Raw values: \Processor(_Total)\% Processor Time:[{_Total 5816021699218 133017887880951263 { 0}}]])"} {"@timestamp":"2022-07-08T23:19:48.097+0200","message":"Formatted values: \Processor(_Total)\% Processor Time:[{_Total 5.3058896133528215 { 0}}]])"} {"@timestamp":"2022-07-08T23:19:48.097+0200","message":"Raw values: \Processor(_Total)\% Processor Time:[{_Total 5816021699218 133017887880951263 { 0}}]])"}

Looking into the implementation history, there does not seem to be any explanation of why the current collection method was chosen - especially with this PR's simpler alternative.

Checklist

[x] My code follows the style guidelines of this project
[x] I have commented my code, particularly in hard-to-understand areas
[ ] ~~I have made corresponding changes to the documentation~~
[ ] ~~I have made corresponding change to the default configuration files~~
[ ] ~~I have added tests that prove my fix is effective or that my feature works~~
[x] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

[ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

Jul 11 '22 14:07 jorgeta

💚 CLA has been signed

Jul 11 '22 14:07 cla-checker-service[bot]

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-08-30T07:50:59.923+0000
Duration: 49 min 5 sec

Test stats :test_tube:

Test	Results
Failed	0
Passed	3719
Skipped	873
Total	4592

:green_heart: Flaky test report

Tests succeeded.

:robot: GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Jul 11 '22 14:07 elasticmachine

cla/check

Jul 20 '22 14:07 gtback

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Jul 20 '22 14:07 elasticmachine

This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b main upstream/main
git merge upstream/main
git push upstream main

Jul 21 '22 15:07 mergify[bot]

/test

Jul 21 '22 16:07 cmacknz

This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b main upstream/main
git merge upstream/main
git push upstream main

Jul 26 '22 18:07 mergify[bot]

/test

Jul 26 '22 19:07 cmacknz

The tests won't run until the conflict is resolved it seems.

Jul 26 '22 19:07 cmacknz

/test

Aug 02 '22 17:08 cmacknz

@jorgeta Thank you so much for this contribution and the super detailed explanation!

I started to look at your code, but the PR is bring many commits that were not made by you, making it pretty hard to review/understand what has changed.

Could you look into it and fix your branch/PR to contain only your commits?

You don't need to worry about keeping your PR up-to-date with main, as long as there are no conflicts and the branch was based on a recent version of main, it will be fine to merge.

Aug 03 '22 10:08 belimawr

This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b main upstream/main
git merge upstream/main
git push upstream main

Aug 08 '22 13:08 mergify[bot]

@belimawr Thanks for the reply. The PR should now be updated in the way you requested.

Aug 15 '22 08:08 jorgeta

/test

Aug 15 '22 12:08 cmacknz

This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @jorgeta? 🙏. For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change) To fixup this pull request, you need to add the backport labels for the needed branches, such as:
backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

Aug 16 '22 15:08 mergify[bot]

/test

Aug 16 '22 17:08 cmacknz

Some of the lint/check steps are failing for metricbeat: https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-32305/23/pipeline

[2022-08-16T17:12:20.283Z] + make -C metricbeat check
[2022-08-16T17:12:20.283Z] make: Entering directory '/var/lib/jenkins/workspace/PR-32305-23-2c910f50-e5a2-4893-9423-401bf7fece6f/src/github.com/elastic/beats/metricbeat'
[2022-08-16T17:12:20.283Z] Installing mage v1.13.0.
[2022-08-16T17:12:20.546Z] /var/lib/jenkins/workspace/PR-32305-23-2c910f50-e5a2-4893-9423-401bf7fece6f/.magefile cleaned
[2022-08-16T17:12:20.546Z] mage check
[2022-08-16T17:12:59.336Z] >> fmt - go-licenser: Adding missing headers
[2022-08-16T17:12:59.337Z] >> fmt - goimports: Formatting Go code
[2022-08-16T17:12:59.337Z] >> fmt - autopep8: Formatting Python code
[2022-08-16T17:13:00.766Z] module/windows/perfmon/reader.go
[2022-08-16T17:13:32.895Z] Generated fields.yml for metricbeat to /var/lib/jenkins/workspace/PR-32305-23-2c910f50-e5a2-4893-9423-401bf7fece6f/src/github.com/elastic/beats/metricbeat/fields.yml
[2022-08-16T17:13:32.895Z] Generated fields.yml for metricbeat to /var/lib/jenkins/workspace/PR-32305-23-2c910f50-e5a2-4893-9423-401bf7fece6f/src/github.com/elastic/beats/metricbeat/fields.yml
[2022-08-16T17:13:32.895Z] >> Building metricbeat.yml for linux/amd64
[2022-08-16T17:13:32.895Z] >> Building metricbeat.reference.yml for linux/amd64
[2022-08-16T17:13:32.895Z] >> Building metricbeat.docker.yml for linux/amd64
[2022-08-16T17:13:32.895Z] Generated fields.yml for metricbeat to /var/lib/jenkins/workspace/PR-32305-23-2c910f50-e5a2-4893-9423-401bf7fece6f/src/github.com/elastic/beats/metricbeat/build/fields/fields.all.yml
[2022-08-16T17:15:14.311Z] >> check: Checking source code for common problems
[2022-08-16T17:15:24.335Z] Error: some files are not up-to-date. Run 'make update' then review and commit the changes. Modified: [metricbeat/module/windows/perfmon/reader.go]
[2022-08-16T17:15:24.335Z] ../dev-tools/make/mage.mk:21: recipe for target 'check' failed
[2022-08-16T17:15:24.335Z] make: *** [check] Error 1
[2022-08-16T17:15:24.335Z] make: Leaving directory '/var/lib/jenkins/workspace/PR-32305-23-2c910f50-e5a2-4893-9423-401bf7fece6f/src/github.com/elastic/beats/metricbeat'
script returned exit code 2

Aug 16 '22 17:08 cmacknz

Some of the lint/check steps are failing for metricbeat: https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-32305/23/pipeline

[2022-08-16T17:12:20.283Z] + make -C metricbeat check
[2022-08-16T17:12:20.283Z] make: Entering directory '/var/lib/jenkins/workspace/PR-32305-23-2c910f50-e5a2-4893-9423-401bf7fece6f/src/github.com/elastic/beats/metricbeat'
[2022-08-16T17:12:20.283Z] Installing mage v1.13.0.
[2022-08-16T17:12:20.546Z] /var/lib/jenkins/workspace/PR-32305-23-2c910f50-e5a2-4893-9423-401bf7fece6f/.magefile cleaned
[2022-08-16T17:12:20.546Z] mage check
[2022-08-16T17:12:59.336Z] >> fmt - go-licenser: Adding missing headers
[2022-08-16T17:12:59.337Z] >> fmt - goimports: Formatting Go code
[2022-08-16T17:12:59.337Z] >> fmt - autopep8: Formatting Python code
[2022-08-16T17:13:00.766Z] module/windows/perfmon/reader.go
[2022-08-16T17:13:32.895Z] Generated fields.yml for metricbeat to /var/lib/jenkins/workspace/PR-32305-23-2c910f50-e5a2-4893-9423-401bf7fece6f/src/github.com/elastic/beats/metricbeat/fields.yml
[2022-08-16T17:13:32.895Z] Generated fields.yml for metricbeat to /var/lib/jenkins/workspace/PR-32305-23-2c910f50-e5a2-4893-9423-401bf7fece6f/src/github.com/elastic/beats/metricbeat/fields.yml
[2022-08-16T17:13:32.895Z] >> Building metricbeat.yml for linux/amd64
[2022-08-16T17:13:32.895Z] >> Building metricbeat.reference.yml for linux/amd64
[2022-08-16T17:13:32.895Z] >> Building metricbeat.docker.yml for linux/amd64
[2022-08-16T17:13:32.895Z] Generated fields.yml for metricbeat to /var/lib/jenkins/workspace/PR-32305-23-2c910f50-e5a2-4893-9423-401bf7fece6f/src/github.com/elastic/beats/metricbeat/build/fields/fields.all.yml
[2022-08-16T17:15:14.311Z] >> check: Checking source code for common problems
[2022-08-16T17:15:24.335Z] Error: some files are not up-to-date. Run 'make update' then review and commit the changes. Modified: [metricbeat/module/windows/perfmon/reader.go]
[2022-08-16T17:15:24.335Z] ../dev-tools/make/mage.mk:21: recipe for target 'check' failed
[2022-08-16T17:15:24.335Z] make: *** [check] Error 1
[2022-08-16T17:15:24.335Z] make: Leaving directory '/var/lib/jenkins/workspace/PR-32305-23-2c910f50-e5a2-4893-9423-401bf7fece6f/src/github.com/elastic/beats/metricbeat'
script returned exit code 2

Is this related to the importing of the fmt package? I tried to make some changes unrelated to the change of the collection method, so that the github actions linting would be approved - but maybe I should leave that? Although then the checks would fail in several places.

Aug 17 '22 08:08 jorgeta

It looks like it is just a file that needs to be formatted, if you run make -C metricbeat check and then commit the changes it will fix it. This is what it wants changed when I run that command myself:

diff --git a/metricbeat/module/windows/perfmon/reader.go b/metricbeat/module/windows/perfmon/reader.go
index 04063eb084..645938f882 100644
--- a/metricbeat/module/windows/perfmon/reader.go
+++ b/metricbeat/module/windows/perfmon/reader.go
@@ -24,8 +24,8 @@ import (
 	"fmt"
 	"regexp"
 	"strings"
-	"unicode"
 	"time"
+	"unicode"
 
 	"github.com/elastic/beats/v7/metricbeat/helper/windows/pdh"
 
@@ -132,7 +132,7 @@ func (re *Reader) getValues() (map[string][]pdh.CounterValue, error) {
 	if err != nil {
 		return nil, err
 	}
-	
+
 	// Collect the displayable value.
 	val, err = re.query.GetFormattedCounterValues()
 	if err != nil {

Aug 17 '22 14:08 cmacknz

/test

Aug 18 '22 13:08 cmacknz

Looking into the implementation history, there does not seem to be any explanation of why the current collection method was chosen - especially with this PR's simpler alternative.

@jorgeta, we had this implementation for some time and we switched last year to the PdhCollectQueryDataEx alternative https://github.com/elastic/beats/pull/26886 as it helped with some of the issues we had at that time (missing counter values when configuring a large number of counter queries ).

Have you tested this implementation while configuring a higher number of counters? Are you getting a constant results?

Aug 22 '22 12:08 narph

Looking into the implementation history, there does not seem to be any explanation of why the current collection method was chosen - especially with this PR's simpler alternative.

@jorgeta, we had this implementation for some time and we switched last year to the PdhCollectQueryDataEx alternative #26886 as it helped with some of the issues we had at that time (missing counter values when configuring a large number of counter queries ).

Have you tested this implementation while configuring a higher number of counters? Are you getting a constant results?

Hi, @narph - thanks for your feedback. We have tested the implementation for a different amount of counters; from having just the one, and up to 90 "windows.perfmon.*" counters, also along with additional 54 "system.*" (that is, tests have collected between 1 and 144 counters in total). I could try to test with a larger number - if this is not considered large.

So, as far as I can see, the implementation in this PR is still different from the one that was before PdhCollectQueryDataEx was added. Seems like in the earlier implementation, the PhdCollectQueryData was only called once per collection. This means that if one is collecting counters every 15 seconds, then PdhGetFormattedCounterValueDouble would use the raw values collected 15 seconds ago along with the ones that were just collected, in order to compute the displayable value. When PhdCollectQueryDataEx was added, this problem was solved. This PR only wants to call PhdCollectQueryData twice with one second of sleep in between for every time a (rate) counter is collected, contrary to what I can see must have been happening in the pre-PhdCollectQueryDataEx implementation.

Aug 24 '22 13:08 jorgeta

This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b main upstream/main
git merge upstream/main
git push upstream main

Aug 24 '22 21:08 mergify[bot]

@fearful-symmetry can you review this as well?

@jorgeta sorry this is taking so long, this is a tricky piece of code with a history of issues as I'm sure you know.

Aug 31 '22 12:08 cmacknz

Thanks everyone, merging this with 8.5 as the target version. We can have our QA team check this on a few different versions of Windows as a final sanity check.

Sep 01 '22 19:09 cmacknz

Thanks everyone!

@cmacknz Would it be possible to backport the changes to version 7? Upgrading to a new version in our company is rather a huge project that is somewhat further down the line, and we would hope to have this fix available as soon as possible.

Sep 02 '22 09:09 jorgeta

Sure, this is effectively a bug fix so we can backport to 7.17

Sep 06 '22 17:09 cmacknz

beats beats copied to clipboard

Changed collection method of the second counter value in the Windows Perfmon module of Metricbeat

What does this PR do?

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

:green_heart: Build Succeeded

Build stats

Test stats :test_tube:

:green_heart: Flaky test report

:robot: GitHub comments

beats
beats copied to clipboard