windows_exporter
windows_exporter copied to clipboard
Version 0.25.1 occasional startup issues
Sometime when the windows_exporter
start we get this error (event log):
ts=2024-01-28T03:43:29.147Z caller=stdlib.go:105 level=error caller=http.go:144 msg="error gathering metrics: error collecting metric Desc{fqName: \"windows_exporter_collector_success\", help: \"windows_exporter: Whether the collector was successful.\", constLabels: {}, variableLabels: {collector}}: failed to prepare scrape: EOF"
The issue is that once the error has happened at startup, the service is up but is unable to recover; The service should either:
- Crash and exit (and then can be restarted)
- Be able to retry and recover
Additional informations
- Restarting the service fixes the issue
- It can append with various collector combination, and even with simple ones (CS, CPU)
- Very hard to reproduce as it only appends randomly (on our 700+ server deployement we have 5-10 issues a week).
- Hard to tell when it started occurring, but it was not present in v0.22
Just wanted to note that I'm also seeing similar.
We also have a large number of servers and for the most part it's fine but a handful of servers will show this error.
Restarting fixes it.
When I follow the error line to see where the EOF
could have happend, I find myself in the perflib
query func https://github.com/prometheus-community/windows_exporter/blob/96c1412a5bd985ee7dc506ac13854641d3164ee4/pkg/perflib/perflib.go#L267
This functions calls more than once the following binary reader:
https://github.com/prometheus-community/windows_exporter/blob/96c1412a5bd985ee7dc506ac13854641d3164ee4/pkg/perflib/raw_types.go#L90-L92
this reader could return EOF
if the given buffer is empty. So one thing what could happen is that for example the queryraw func returns an empty buffer here:
https://github.com/prometheus-community/windows_exporter/blob/96c1412a5bd985ee7dc506ac13854641d3164ee4/pkg/perflib/perflib.go#L268-L283
which would then return the EOF
on line 283 and end in the mentioned error: failed to prepare scrape: EOF"
Since it is not really reproducible and very hard to guess on what perflib
call this could happen, my suggestion would be to add the query string
to the error message in order to have more visibility and maybe a starting point for further debugging.
Maybe something like this:
if err != nil {
return nil, fmt.Errorf("failed to read performance data block for %s with: %v", query, err)
}
@breed808 or @jkroepke do you think that would be worth adding? I am open to create PR :)
Just to mention that i also stumbled upon this issue. Occurrence ratio ~3% (1 out of 33hosts). Although restarting the service did not fix the issue for me
Are you able to test your fix?
@jkroepke did you address me? Cos I don't have a solution/fix for this, my approach would only add more visibility which could help to find the cause of this problem. But I just saw the PR #1459 and I think the effort should go more in this direction. Cos if I understand it correctly by success of this PR most of the perflib calls would go away anyways.
This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.
Hello, regarding the problem it went away when i updated my hosts to windows 11