beats CPU metrics are incorrect on Windows machines with more than 64 cores

On Windows, Metricbeat measures CPU use via the Windows API call GetSystemTimes. Each metrics interval, it fetches the CPU numbers, and compares them to the previous measurement to determine CPU load during that interval. On most systems this includes CPU time "including all threads in all processes, on all processors". However, on systems with more than 64 cores, it returns only the data for the current processor group of up to 64 cores.

This has two consequences on high-core machines:

Reported CPU metrics include only CPU cores in the same processor group as the Metricbeat process.
When the Windows scheduler moves Metricbeat from one processor group to another (which is unpredictable but happens regularly), GetSystemTimes returns data from a different set of cores. If the new processor group has a lower CPU total than the previous one, Metricbeat will report negative numbers for some CPU metrics.

The most visible symptom is occasional negative numbers in CPU-related metrics, especially coming in pairs of two adjacent data points.

This seems to apply to ~all Metricbeat versions, and all versions of Windows that support more than 64 CPU cores.

Sep 20 '24 17:09 faec

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Sep 20 '24 17:09 elasticmachine

fyi @flexitrev

Oct 21 '24 10:10 nimarezainia

It is not immediately clear how to solve this, and the solution does not strictly seem like it will be simple, especially without being able to use the Task Manager implementation as a reference.

I suspect we will need something like a thread or process per processor group with their processor group affinities set appropriately so that they get scheduled into different processor groups.

Oct 21 '24 16:10 cmacknz

However, on systems with more than 64 cores, it returns only the data for the current processor group of up to 64 cores.

Is this documented anywhere? I don't see it in the GetSystemTimes docs.

My reading of this MS doc suggests our threads can switch between processor groups. Here's python pseudocode to do this. I don't have a >64T system to test with.


global last_system_times = {}


def get_cpu_time_deltas():
    idle_time_delta, kernel_time_delta, user_time_delta = 0, 0, 0
    
    # Backup original affinity
    GetThreadGroupAffinity(GetCurrentThread(), &saved_group_affinity)

    # Enumerate each NUMA node
    for node in range(GetNumaHighestNodeNumber()):
        # Enumerate all the processor groups for this NUMA node
        for group_affinity in GetNumaNodeProcessorMask2(node):
        
            # Switch to this processor group
            SetThreadGroupAffinity(GetCurrentThread(), &group_affinity)
            
            # Retrieve metrics for this processor group
            GetSystemTimes(&idle, &kernel, &user)
            
            # Retrieve old values
            if group_affinity.group in last_system_times:
                last_idle, last_kernel, last_user = last_system_times[group_affinity.group]
                
                # Compute deltas
                idle_time_delta += (idle - last_idle)
                kernel_time_delta += (kernel - last_kernel)
                user_time_delta += (user - last_user)
            
            # Store latest values
            last_system_times[group_affinity.group] = (idle, kernel, user)
            
    # Restore original affinity
    SetThreadGroupAffinity(GetCurrentThread(), &saved_group_affinity)
    
    return idle_time_delta, kernel_time_delta, user_time_delta

Oct 22 '24 18:10 gabriellandau

@cmacknz @flexitrev @pierrehilbert

I ran two instances of the metricbeat binary on my personal Windows machine (with 8 cores): one with performance counters enabled and the other with performance counters disabled (using the latest build from the main branch). Here are the results:

The left side of the dashboard is one with performance counters and right side is latest build from main

Average CPU usage of my system:
Idle CPU % (per core)
System CPU % (per core)
User CPU % (per core)

There is a clear correlation between the two sets of data, with the values being identical across both versions.

Nov 26 '24 09:11 VihasMakwana

I don't think values are identical but instead similars. Did you check directly in ES values to compare them?

Nov 26 '24 09:11 pierrehilbert

@pierrehilbert I think I know why. When I started two metricbeat binaries, there was a time difference, and the CPU data collection interval was set to 5 seconds. As a result, the two Metricbeat instances likely didn’t capture the CPU times at the exact same moment.

Let me retest it for 30 mins and reduce the interval to 1s and compare the results.

Nov 26 '24 09:11 VihasMakwana

@pierrehilbert

Did you check directly in ES values to compare them?

Yes.

Nov 26 '24 09:11 VihasMakwana

@pierrehilbert updated results. I ran two binaries at the exact same time (almost)

The results are better than before. What do you think?

Nov 26 '24 11:11 VihasMakwana

I don't see any performance impact that would worry me. This means if performance counters give us better, more reliable data there is no reason we shouldn't use them.

Nov 26 '24 15:11 flexitrev

Agreed, this looks similar at a high level.

Nov 26 '24 16:11 cmacknz

Since this change looks like it is happening in elastic-agent-system-metrics, which has no mechanism to tie things to a stack release, I would like some way for us to switch between both methods to exist. One that ideally we can expose in the configuration of the system/metrics input.

This will let us update to the latest version of elastic-agent-system-metrics everywhere (because of a CVE related dependency update for example) without worry of also changing underlying behavior, and it will give us a way to easily switch back to the old way immediately if there is some unexpected problem to work around.

Let's avoid the problems we had with turning off the introduction of the degraded state in system metrics, and build a proper feature flag for this from the beginning.

Nov 26 '24 16:11 cmacknz

Since this change looks like it is happening in elastic-agent-system-metrics, which has no mechanism to tie things to a stack release, I would like some way for us to switch between both methods to exist. One that ideally we can expose in the configuration of the system/metrics input.

This will let us update to the latest version of elastic-agent-system-metrics everywhere (because of a CVE related dependency update for example) without worry of also changing underlying behavior, and it will give us a way to easily switch back to the old way immediately if there is some unexpected problem to work around.

Let's avoid the problems we had with turning off the introduction of the degraded state in system metrics, and build a proper feature flag for this from the beginning.

I implemented a similar thing initially. Something like following:

Use Performance counter by default.
If any errors are encountered, fallback to current implementation.

But your point makes more sense. A flag would be much more preferable.

Nov 26 '24 16:11 VihasMakwana

I realize I jumped the gun on saying anything about performance here, since the binaries were being run at the same time, and measuring the same thing (thanks @strawgate). What I'd like to make sure is that we are not introducing a performance impact. Can we benchmark the two different approaches?

Nov 26 '24 16:11 flexitrev

What I'd like to make sure is that we are not introducing a performance impact. Can we benchmark the two different approaches?

Yes. I'll keep you posted on it.

Nov 26 '24 16:11 VihasMakwana

Here's the result of running the metricbeat binaries on 180 core machine.

The left side of the dashboard is one with performance counters and right side is latest build from main

The number on the right side only reports CPUs from current processor group while the left side of the dashboard utilises performance counters and reports metrics for each core installed on the system.

Total CPU usage reported by both metricbeats:

The one on the left side is accurate with Task Manager and it uses performance counters.

Nov 27 '24 18:11 VihasMakwana

We're not introducing any performance impact. The memory usage by metricbeat running under elastic-agent stays constant. Nothing really stands out between the two versions.

The CPU usage by metricbeat stays the same, before and after my implmentation. Here's the graph.

The "dip" in the graph is when I turned off the agent, so no data was reported during that time, so you can ignore that. The CPU usage to the left of dip is with 8.17.0-SNAPSHOT version elastic-agent and right side is with my implementation of performance counters.

Nov 28 '24 21:11 VihasMakwana

Reopening until verified once more.

Dec 10 '24 05:12 VihasMakwana

Hi team, could you please help check and confirm from which release the fix of this bug will be available? We have an user have upgraded to 8.16.2 and 8.17.0 but issue persist.

I see these 3 PRs merged on Dec 18 2024. [8.x](backport #41965) [system/cpu][system/core] - New config for using performance counters #42024 [8.16](backport #41965) [system/cpu][system/core] - New config for using performance counters #42067 [8.17](backport #41965) [system/cpu][system/core] - New config for using performance counters #42068

Could you please check and confirm the fix is not available in 8.16.2 and 8.17.0, but will be available in the next release which will be 8.16.3 and 8.17.1? Thank you.

Jan 15 '25 00:01 bojian-ben-li

Hello @VihasMakwana ,

Could you please check and get back.

Thank you!

Jan 16 '25 03:01 vishal-jogi

@vishal-jogi @bojian-ben-li yes, you're correct.

It will be released in 8.16.3 and 8.17.1.

Jan 16 '25 09:01 VihasMakwana

@VihasMakwana Thanks for the confirmation.

I assume the public release dates for these 2 versions are 21 Jan 2025, just checking if it is still valid.

Regards, Vishal

Jan 16 '25 11:01 vishal-jogi

Yes. That's valid

Jan 16 '25 12:01 VihasMakwana

Hello @VihasMakwana ,

Thanks for the confirmation!

Jan 17 '25 06:01 vishal-jogi

Hi folks,

Just reporting that version 8.17.1 still have the issue! I have tested and the issue is still there!

Jan 22 '25 18:01 apmartins85

@apmartins85 can you share your config?

Jan 22 '25 18:01 VihasMakwana

Hi @VihasMakwana I removed a couple of things that it doesn't matter. I have a case in fight inside elasticsearch.

metricbeat.modules:

#-------------------------------- System Module --------------------------------
- module: system
  period: 30s
  metricsets:
    - cpu
    - memory
    - network
    - process_summary
  enabled: true
  cpu.metrics:
    - normalized_percentages
    - percentages
- module: system
  period: 60s
  metricsets:
    - filesystem
    - fsstat
  filesystem.ignore_types: [sysfs, rootfs, ramfs, bdev, proc, cgroup, cpuset, tmpfs, devtmpfs, debugfs, securityfs, sockfs, dax, bpf, pipefs, configfs, devpts, hugetlbfs, autofs, pstore, mqueue, binfmt_misc]


- module: beat
  metricsets:
    - stats
    - state
  xpack.enabled: true
  period: 10s
  hosts: ["http://localhost:5066"]


metricbeat.config.modules:
  # Glob pattern for configuration loading
  path: ${path.config}/modules.d/*.yml

  # Set to true to enable config reloading
  reload.enabled: false

  # Period on which files under path should be checked for changes
  #reload.period: 10s

#================================ Processors =====================================

# Configure processors to enhance or manipulate events generated by the beat.

processors:
  - add_host_metadata: ~
  - add_cloud_metadata:
      timeout: 10s

VM is on azure: Standard M96ds 2 v3 host.os.name: Windows Server 2022 Datacenter

Jan 22 '25 18:01 apmartins85

@apmartins85 you need to enable a config to solve this issue. Let me share the config.

Jan 22 '25 20:01 VihasMakwana

@apmartins85 here it is:

https://github.com/elastic/beats/blob/424070e87d831d2d66a7514e1c1120ad540a86db/metricbeat/metricbeat.reference.yml#L145-L148

Your updated config would look like:

metricbeat.modules:

#-------------------------------- System Module --------------------------------
- module: system
  period: 30s
  use_performance_counters: true
  metricsets:
    - cpu
    - memory
    - network
    - process_summary
  enabled: true
  cpu.metrics:
    - normalized_percentages
    - percentages
- module: system
  period: 60s
  metricsets:
    - filesystem
    - fsstat
  filesystem.ignore_types: [sysfs, rootfs, ramfs, bdev, proc, cgroup, cpuset, tmpfs, devtmpfs, debugfs, securityfs, sockfs, dax, bpf, pipefs, configfs, devpts, hugetlbfs, autofs, pstore, mqueue, binfmt_misc]

Jan 22 '25 21:01 VihasMakwana

Cool @VihasMakwana , let me test! I have 1 question: What happens if I set this variable to true for VM with less than 64 cores?

Thanks

Jan 22 '25 21:01 apmartins85

beats beats copied to clipboard

CPU metrics are incorrect on Windows machines with more than 64 cores

beats
beats copied to clipboard