windows_exporter icon indicating copy to clipboard operation
windows_exporter copied to clipboard

High memory usage on .16.0

Open jesus-velez opened this issue 3 years ago • 53 comments

Hello!

I have a couple of servers reporting high memory usage by the windows_exporter. I am using the default configuration/collectors. Is there anyway to limit the resource usage on the exporter?

Screen Shot 2021-07-01 at 3 05 13 PM

Exporter version: windows_exporter, version 0.16.0 (branch: master, revision: f316d81d50738eb0410b0748c5dcdc6874afe95a) build user: appveyor-vm\appveyor@appveyor-vm build date: 20210225-10:52:19 go version: go1.15.6 platform: windows/amd64

OS Name: Microsoft Windows Server 2016 Standard OS Version: 10.0.14393 N/A Build 14393

Handles NPM(K) PM(K) WS(K) CPU(s) Id SI ProcessName


1260 142 12748860 12496424 110,102.69 11388 0 windows_exporter

jesus-velez avatar Jul 01 '21 20:07 jesus-velez

How long has the exporter been running on the affected hosts? When initially started, how much memory does the exporter use?

Graphing go_memstats_alloc_bytes, go_memstats_alloc_bytes_total and go_memstats_frees_total for this host should provide more info.

I'd also recommend navigating to http://hostname:9182/debug/pprof/heap and uploading the output here.

breed808 avatar Jul 02 '21 02:07 breed808

Thanks for following up with me. The box has been up for 49 days so at least that much. I there any metric that records that information?

go_memstats_alloc_bytes

Screen Shot 2021-07-02 at 10 11 23 AM

go_memstats_alloc_bytes_total

Screen Shot 2021-07-02 at 10 10 43 AM

go_memstats_frees_total

Screen Shot 2021-07-02 at 10 12 38 AM

heap.zip

jesus-velez avatar Jul 02 '21 15:07 jesus-velez

I have a similar thing, but we are on 0.15: The go_memstat* things look pretty normal to me:

go_memstats.*bytes are very constant during the week: go_memstats_bytes

go_memstats_alloc_bytes and frees_total: alloc

go_memstats_frees_total

What goes through the roof is windows_process_private_bytes: windows_process_private_bytes

And for what it's worth, pprof/heap and pprof/allocs: pprof.zip

datamuc avatar Jul 07 '21 13:07 datamuc

I have both versions deployed and .15 also displays the same behavior. Whatever I can do to help please do let me know. @breed808.

jesus-velez avatar Jul 07 '21 13:07 jesus-velez

Thanks all, this is really helpful info. I'll get through the heap/alloc dumps, and will try to reproduce the issue.

breed808 avatar Jul 08 '21 06:07 breed808

It doesn't happen on every machine, but if it happens it seems to be the service collector that is leaking memory.

In general the windows_exporter is leaking memory anyways, it's just that it may leak with the service collector a huge amount of memory.

datamuc avatar Jul 09 '21 10:07 datamuc

@breed808, I don't know what @datamuc experience is but it seems like mostly servers with sql server installed get this issue. I have the exporter on application servers and database servers and I have yet to notice anything on the application servers. Another point to add is that I have seeing it on both VM and physical servers.

jesus-velez avatar Jul 12 '21 14:07 jesus-velez

Unfortunately the heap dumps aren't too helpful: they're showing similar results to the go_memstats_alloc_bytes_total metric. @datamuc I wasn't able to view the dumps you provided, could you re-generate and upload them? Yours are ASCII text files, while the dump @jesus-velez submitted was gzip compressed data.

Are we able to confirm if the excessive memory consumption occurs when disabling all collectors that use WMI as a metric source? These are currently cpu_info, fsrmquota, msmq, mssql, remote_fx, service, system, terminal_services, and thermalzone

breed808 avatar Jul 13 '21 10:07 breed808

Unfortunately the heap dumps aren't too helpful: they're showing similar results to the go_memstats_alloc_bytes_total metric. @datamuc I wasn't able to view the dumps you provided, could you re-generate and upload them? Yours are ASCII text files, while the dump @jesus-velez submitted was gzip compressed data.

Ok, I've accessed the urls with a browser that gave me the text files. Now I used Invoke-Webrequest and I got the binary stuff:

pprof.zip

I did some more tests, in any case there was an infinite powershell loop running that hit the /metrics endpoint.

In the zip there are 2 directories, host2 are the debug/pprof infos of a host that doesn't leak so much. I've done 2 snapshots in every test.

host1 is a host that has a pretty high leak, pprof1 and pprof2 where done with the following config:

collectors:
    enabled: "cpu,cs,logical_disk,net,os,system,textfile,memory,logon,tcp,terminal_services,thermalzone,iis,process"

and pprof3 and pprof4 where done with:

collectors:
    enabled: "cs,logical_disk,net,os,textfile,memory,logon,tcp,iis,process"

Are we able to confirm if the excessive memory consumption occurs when disabling all collectors that use WMI as a metric source? These are currently cpu_info, fsrmquota, msmq, mssql, remote_fx, service, system, terminal_services, and thermalzone

I've disabled all the mentioned collectors (pprof3/4) but it was still leaking.

windows_process_private_bytes while the endless loop was running: image

And here without the loop, just normal prometheus scraping:

image

datamuc avatar Jul 14 '21 12:07 datamuc

Hmm, those heap dumps are similar to those submitted earlier, with only a few MB allocated :disappointed:

@datamuc from previous comments, it appears the service collector may be responsible, but host1 in your previous comment doesn't have it enabled? It would be worth identifying the collectors responsible for the leaking. Are you free to do that?

I'll see if I can identify the commit or exporter version where the leaking was introduced.

breed808 avatar Jul 15 '21 11:07 breed808

Hmm, those heap dumps are similar to those submitted earlier, with only a few MB allocated 😞

@datamuc from previous comments, it appears the service collector may be responsible, but host1 in your previous comment doesn't have it enabled? It would be worth identifying the collectors responsible for the leaking. Are you free to do that?

I'll see if I can identify the commit or exporter version where the leaking was introduced.

Yes, we have deactivated the service collector globally because it contributed a lot to the leaks. It is better now, but still not good, so my guess is that it has something to do with the number of metrics returned? service collector has a lot of metrics.

I can do some more testing tomorrow I guess.

datamuc avatar Jul 15 '21 11:07 datamuc

The config:

log:
  level: fatal
collectors:
  enabled: "process"
collector:
  process:
    whitelist: "windows_exporter|exporter_exporter"
    blacklist: ""

And I did run this powershell script for an hour or so:

while($true) {
   $request = Invoke-Webrequest -URI http://127.77.1.1:5758/metrics
   $line = $request.RawContent -split "[`r`n]" | select-string -Pattern 'windows_process_private_bytes.*windows_exporter'
   ($line -split " ")[-1]
}

That resulted in the following graph of windows_process_private_bytes: image

So, if it is a collector then process collector is one of them.

datamuc avatar Jul 15 '21 15:07 datamuc

I've been able to reproduce this with the script. The memory leak is present in the process collector prior to 4f891338935c9c327d32c9cc1fce72cba1f426f3, so I'm inclined to believe it may be an exporter-wide leak that isn't specific to one collector.

Alternatively, it may be a leak on the Windows side when certain WMI and/or Perflib queries are submitted. The Go memory stats aren't showing any leaks, which isn't too helpful.

I'll continue testing to see if I can identify the commit that introduced the leak.

breed808 avatar Jul 17 '21 11:07 breed808

I've did some more testing. If I only enable the textfile exporter, then there is no leak.

I tried to investigate a bit. I've seen that the windows_exporter uses a library from the perflib_exporter to access most of the performance_values. Maybe both implementations, the perflib and the old WMI one are leaking?

What follows maybe wrong, I've tried to reason about the code:

I've looked into what telegraf is doing, because it does a similar thing. They open pdh.dll and call functions from this library. Before they open a new query they close the old one?

The Close call leads to this code where they explicitly mention to free some memory.

Then I looked into the perflib_exporter, it comes with a perlfib package which provides access to the performance counters. Unlike telegraf they are not using pdh.dll because it is too high level?

Anyway, it does syscall.HKEY_PERFORMANCE_DATA to get the counters. So I googled a bit for it and found this page:

To obtain performance data from a remote system, call the RegConnectRegistry function. Use the computer name of the remote system and use HKEY_PERFORMANCE_DATA as the key. This call retrieves a key representing the performance data for the remote system. Use this key rather than HKEY_PERFORMANCE_DATA key to retrieve the data.

Be sure to use the RegCloseKey function to close the handle to the key when you are finished obtaining the performance data. This is important for both the local and remote cases:

  • RegCloseKey(HKEY_PERFORMANCE_DATA) does not actually close a registry handle, but it clears all cached data and frees the loaded performance DLLs.
  • RegCloseKey(hkeyRemotePerformanceData) closes the handle to the remote machine's registry.

But I cannot find the word "close" or "Close" anywhere in perflib_exporters codebase. So I thought, hooray, found the culprit and started a perflib_exporter which doesn't seem to leak at all. Long story short, I still have no clue what is going on. I think it is still possible that the perflib_exporter uses the perflib library a bit differently? Like reusing some datastructures while the windows_exporter is always asking for a new one or so?

datamuc avatar Jul 20 '21 08:07 datamuc

Does anybody have an idea? We are considering to move to telegraf and scrape that for the windows metrics.

datamuc avatar Sep 13 '21 13:09 datamuc

My priorities have shifted for the time being, but I have to come back to this issue pretty soon. I resorted to schedule a daily exporter bounce in the mean time. I can say that I found this issue has been around since the wmi exporter days. I recently logged into a server that we somehow missed on our exporter upgrades and the memory usage was really high.

jesus-velez avatar Sep 13 '21 13:09 jesus-velez

@datamuc That looks like a very good find, and even if the perflib exporter appears to not leak, I'd argue it it still incorrect not to close the key. It might require a bit of locking to do this right though, otherwise I think overlapping scrapes might lead to issues. I'm not sure when I could commit to trying this out, but if anyone else wants to give it a go without the locking bits, I think we'd basically just need to add defer syscall.RegCloseKey(syscall.HKEY_PERFORMANCE_DATA) here.

carlpett avatar Sep 18 '21 19:09 carlpett

@carlpett I've added the mentioned line and compiled a binary. The change didn't help anything regarding the leak. :disappointed:

datamuc avatar Oct 08 '21 09:10 datamuc

What do you think? Should we try to bring this issue up in the Prometheus & The Ecosystem Community Meeting? Maybe sombody there can help or knows somebody who can help?

datamuc avatar Oct 14 '21 19:10 datamuc

Are we able to confirm if the excessive memory consumption occurs when disabling all collectors that use WMI as a metric source? These are currently cpu_info, fsrmquota, msmq, mssql, remote_fx, service, system, terminal_services, and thermalzone

This is not true, other collectors also use wmi. I've started the exporter with

windows_exporter.exe --collectors.enabled=memory,os,net,time,cpu,cache

And this doesn't leak at all. So I'm pretty sure now that it leaks somewhere in github.com/StackExchange/wmi or github.com/go-ole/go-ole. The first one is definitly unmaintained, not so sure about the second one.

datamuc avatar Oct 20 '21 20:10 datamuc

That's a good find! I'll see if I can identify which of those libraries are the cause of the leak.

breed808 avatar Oct 27 '21 09:10 breed808

Hi, same problem here. So I tried to build a new package with the commit of @datamuc on perflib and performed some tests (100 req/s on /metrics during 30 minutes) to compare both versions, the configured collectors are cpu,cs,logical_disk,net,os,system,textfile,process.

The patched version (the curve on the right) seems to do better than v0.16.0 (the one on the left) in terms of memory usage: image

But a leak is still here. Do you have any other idea to fix this ?

Edit: the test was performed on master branch, so only the process collector was using StackExchange/wmi, which explains lower memory usage. After removing process collector, memory usage is constant around 20-30MB.

geraudster avatar Nov 18 '21 16:11 geraudster

Hi, same problem here. So I tried to build a new package with the commit of @datamuc on perflib and performed some tests (100 req/s on /metrics during 30 minutes) to compare both versions, the configured collectors are cpu,cs,logical_disk,net,os,system,textfile,process.

But a leak is still here. Do you have any other idea to fix this ?

We removed the process and service collector from our configuration (and added the tcp collector, so it is: [defaults] - service + tcp) now the memory usage is stable. It seems that every collector that makes use of github.com/StackExchange/wmi leaks.

datamuc avatar Nov 19 '21 09:11 datamuc

The patch above was merged into the perflib_exporter library with https://github.com/leoluk/perflib_exporter/pull/34.

Can sombody take care of updating the dependency in windows_exporter please?

datamuc avatar Dec 05 '21 09:12 datamuc

@datamuc Thanks for all the work that your are putting in on this.

jesus-velez avatar Dec 10 '21 16:12 jesus-velez

Are the leaks in the WMI collectors tracked elsewhere? Otherwise I vote for reopening this.

datamuc avatar Dec 20 '21 07:12 datamuc

My mistake, I'll reopen for the WMI tracking

breed808 avatar Dec 20 '21 21:12 breed808

@breed808 do we know when the next release is coming up? I want to help testing this but when I try to create the exe file I am getting some weird common log errors.

cannot find module providing package github.com/prometheus/common/log

jesus-velez avatar Dec 27 '21 21:12 jesus-velez

@datamuc I checked all of our boxes and it seems like the problem is mainly with Windows Server 2016. We also have 2012 and 2019, neither show significant leaks. Do you have the same experience?

jesus-velez avatar Dec 28 '21 19:12 jesus-velez

I also found this, seems to be related? https://github.com/StackExchange/wmi/issues/27#issue-237024341

jesus-velez avatar Dec 28 '21 20:12 jesus-velez