windows_exporter icon indicating copy to clipboard operation
windows_exporter copied to clipboard

Version 0.25.1 occasional startup issues

Open JDA88 opened this issue 1 year ago • 7 comments

Sometime when the windows_exporter start we get this error (event log): ts=2024-01-28T03:43:29.147Z caller=stdlib.go:105 level=error caller=http.go:144 msg="error gathering metrics: error collecting metric Desc{fqName: \"windows_exporter_collector_success\", help: \"windows_exporter: Whether the collector was successful.\", constLabels: {}, variableLabels: {collector}}: failed to prepare scrape: EOF"

The issue is that once the error has happened at startup, the service is up but is unable to recover; The service should either:

  1. Crash and exit (and then can be restarted)
  2. Be able to retry and recover

Additional informations

  • Restarting the service fixes the issue
  • It can append with various collector combination, and even with simple ones (CS, CPU)
  • Very hard to reproduce as it only appends randomly (on our 700+ server deployement we have 5-10 issues a week).
  • Hard to tell when it started occurring, but it was not present in v0.22

JDA88 avatar Jan 29 '24 10:01 JDA88

Just wanted to note that I'm also seeing similar.

We also have a large number of servers and for the most part it's fine but a handful of servers will show this error.

Restarting fixes it.

safster123 avatar Apr 04 '24 13:04 safster123

When I follow the error line to see where the EOF could have happend, I find myself in the perflib query func https://github.com/prometheus-community/windows_exporter/blob/96c1412a5bd985ee7dc506ac13854641d3164ee4/pkg/perflib/perflib.go#L267

This functions calls more than once the following binary reader: https://github.com/prometheus-community/windows_exporter/blob/96c1412a5bd985ee7dc506ac13854641d3164ee4/pkg/perflib/raw_types.go#L90-L92 this reader could return EOF if the given buffer is empty. So one thing what could happen is that for example the queryraw func returns an empty buffer here: https://github.com/prometheus-community/windows_exporter/blob/96c1412a5bd985ee7dc506ac13854641d3164ee4/pkg/perflib/perflib.go#L268-L283

which would then return the EOF on line 283 and end in the mentioned error: failed to prepare scrape: EOF"

Since it is not really reproducible and very hard to guess on what perflib call this could happen, my suggestion would be to add the query string to the error message in order to have more visibility and maybe a starting point for further debugging.

Maybe something like this:

if err != nil {
  return nil, fmt.Errorf("failed to read performance data block for %s with: %v", query, err)
}

@breed808 or @jkroepke do you think that would be worth adding? I am open to create PR :)

DiniFarb avatar Apr 04 '24 18:04 DiniFarb

Just to mention that i also stumbled upon this issue. Occurrence ratio ~3% (1 out of 33hosts). Although restarting the service did not fix the issue for me

billtzim avatar Apr 25 '24 12:04 billtzim

Are you able to test your fix?

jkroepke avatar Apr 25 '24 13:04 jkroepke

@jkroepke did you address me? Cos I don't have a solution/fix for this, my approach would only add more visibility which could help to find the cause of this problem. But I just saw the PR #1459 and I think the effort should go more in this direction. Cos if I understand it correctly by success of this PR most of the perflib calls would go away anyways.

DiniFarb avatar May 02 '24 18:05 DiniFarb

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

github-actions[bot] avatar Aug 01 '24 02:08 github-actions[bot]

Hello, regarding the problem it went away when i updated my hosts to windows 11

billtzim avatar Aug 01 '24 06:08 billtzim