windows_exporter icon indicating copy to clipboard operation
windows_exporter copied to clipboard

The windows_exporter service terminated unexpectedly.

Open PugachevDK opened this issue 9 months ago • 12 comments

Current Behavior

The windows_exporter service terminated unexpectedly. It has done this 1 time(s).

Expected Behavior

I use version 0.30.5 Config:

---
collectors:
  enabled: cpu,logical_disk,memory,net,os,service,system,license,scheduled_task,textfile,time
collector:
  service:
    include: ^(QORT.*|windows\_exporter|winrm)$
    exclude: 
  scheduled_task:
    include: ^/BackQORT/.+
    exclude: 
  textfile:
    directories: c:\windows_exporter\textfile
log:
  level: info
  file: eventlog
  format: json
scrape:
  timeout-margin: 0.5
telemetry:
  path: /metrics
  max-requests: 0
web:
  listen-address: ":9182"

Service is crashing on some nodes. I attach logs file from the server, where there is something to look at in the log, except for startup events

Event in the System log:

Log Name:      System
Source:        Service Control Manager
Date:          31.03.2025 8:09:29
Event ID:      7034
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      kzp-bqort-02.corp.abylaigs.kz
Description:
The windows_exporter service terminated unexpectedly.  It has done this 1 time(s).
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Service Control Manager" Guid="{555908d1-a6d7-4695-8e1e-26931d2012f4}" EventSourceName="Service Control Manager" />
    <EventID Qualifiers="49152">7034</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8080000000000000</Keywords>
    <TimeCreated SystemTime="2025-03-31T05:09:29.2848407Z" />
    <EventRecordID>13858</EventRecordID>
    <Correlation />
    <Execution ProcessID="732" ThreadID="3008" />
    <Channel>System</Channel>
    <Computer>kzp-bqort-02.corp.abylaigs.kz</Computer>
    <Security />
  </System>
  <EventData>
    <Data Name="param1">windows_exporter</Data>
    <Data Name="param2">1</Data>
    <Binary>770069006E0064006F00770073005F006500780070006F0072007400650072000000</Binary>
  </EventData>
</Event>

Steps To Reproduce


Environment

  • windows_exporter Version: 0.30.5
  • Windows Server Version: Windows Server 2022 Datacenter

windows_exporter logs

{"time":"2025-03-30T22:49:29.2790127Z","level":"WARN","source":"collect.go:212","msg":"collector memory failed after 4.519ms, resulting in 3 metrics","err":"panic in collector memory: runtime error: index out of range [0] with length 0. stack: goroutine 19412 [running]:\nruntime/debug.Stack()\n\tC:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/runtime/debug/stack.go:26 +0x5e\ngithub.com/prometheus-community/windows_exporter/pkg/collector.(*Collection).collectCollector.func1.1()\n\tD:/a/windows_exporter/windows_exporter/pkg/collector/collect.go:128 +0x65\npanic({0x12758a0?, 0xc000198000?})\n\tC:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/runtime/panic.go:785 +0x132\ngithub.com/prometheus-community/windows_exporter/internal/collector/memory.(*Collector).collectPDH(0xc0002b2000, 0xc0005b3650)\n\tD:/a/windows_exporter/windows_exporter/internal/collector/memory/memory.go:416 +0x1109\ngithub.com/prometheus-community/windows_exporter/internal/collector/memory.(*Collector).Collect(0xc0002b2000, 0xc0005b3650)\n\tD:/a/windows_exporter/windows_exporter/internal/collector/memory/memory.go:351 +0x45\ngithub.com/prometheus-community/windows_exporter/pkg/collector.(*Collection).collectCollector.func1()\n\tD:/a/windows_exporter/windows_exporter/pkg/collector/collect.go:135 +0x82\ncreated by github.com/prometheus-community/windows_exporter/pkg/collector.(*Collection).collectCollector in goroutine 19400\n\tD:/a/windows_exporter/windows_exporter/pkg/collector/collect.go:124 +0x1aa\n"}

Anything else?

windows_exporter_events.zip

PugachevDK avatar Mar 31 '25 05:03 PugachevDK

Please check this snapshot build, if this helps: https://github.com/prometheus-community/windows_exporter/actions/runs/14169959302/artifacts/2850404347

jkroepke avatar Mar 31 '25 12:03 jkroepke

Works for 3 hours, doesn't crash. There are errors, I am attaching the log

windows_exporter_events_2.csv

PugachevDK avatar Mar 31 '25 18:03 PugachevDK

service down

{"time":"2025-03-31T17:46:29.2816149Z","level":"WARN","source":"collect.go:212","msg":"collector memory failed after 8.4227ms, resulting in 16 metrics","err":"panic in collector memory: runtime error: index out of range [0] with length 0. stack: goroutine 20270 [running]:\nruntime/debug.Stack()\n\tC:/hostedtoolcache/windows/go/1.24.1/x64/src/runtime/debug/stack.go:26 +0x5e\ngithub.com/prometheus-community/windows_exporter/pkg/collector.(*Collection).collectCollector.func1.1()\n\tD:/a/windows_exporter/windows_exporter/pkg/collector/collect.go:128 +0x65\npanic({0x181c200?, 0xc000168288?})\n\tC:/hostedtoolcache/windows/go/1.24.1/x64/src/runtime/panic.go:792 +0x132\ngithub.com/prometheus-community/windows_exporter/internal/collector/memory.(*Collector).collectPDH(0xc00016b400, 0xc0003cc310)\n\tD:/a/windows_exporter/windows_exporter/internal/collector/memory/memory.go:496 +0xef3\ngithub.com/prometheus-community/windows_exporter/internal/collector/memory.(*Collector).Collect(0xc00016b400, 0xc0003cc310)\n\tD:/a/windows_exporter/windows_exporter/internal/collector/memory/memory.go:351 +0x45\ngithub.com/prometheus-community/windows_exporter/pkg/collector.(*Collection).collectCollector.func1()\n\tD:/a/windows_exporter/windows_exporter/pkg/collector/collect.go:135 +0x82\ncreated by github.com/prometheus-community/windows_exporter/pkg/collector.(*Collection).collectCollector in goroutine 20282\n\tD:/a/windows_exporter/windows_exporter/pkg/collector/collect.go:124 +0x1aa\n"}

PugachevDK avatar Mar 31 '25 19:03 PugachevDK

What is your scrape interval?

jkroepke avatar Mar 31 '25 20:03 jkroepke

scrape:
  timeout-margin: 0.5

PugachevDK avatar Mar 31 '25 20:03 PugachevDK

Thats something different. scrape interval is configured on Prometheus side.

jkroepke avatar Mar 31 '25 20:03 jkroepke

Scrape interval on Prometeus side - 1 time per minute

PugachevDK avatar Mar 31 '25 20:03 PugachevDK

at the moment, no idea. It seems like variables are changes mid scraping for unknown reaons. Do you have a special scrape config?

jkroepke avatar Apr 06 '25 19:04 jkroepke

Hi! To be honest, we're experiencing kinda similar behavior.

The exporter fails multiple times per day (usually, it's working for a few hours and then exits). There's no exact message in the event log describing what the issue is.

Some of the event log messages are:

# the latest log before failure
source=main.go:175 msg="couldn't initialize collector" err="error build collector net: failed to create Network Interface collector: failed to initialize collector: GetCounterInfo: buffer length is zero"

# the latest log before another failure
# seems like there are only informational messages, no error/warning messages found
source=tls_config.go:350 msg="TLS is disabled." http2=false address=[::]:9182

# the latest log before another failure
# seems like the message was generated 2 hours before the failure, so it isn't that related
source=collect.go:212 msg="collector system failed after 7.2576ms, resulting in 0 metrics" err="panic in collector system: runtime error: index out of range [0] with length 0. stack: goroutine 28222 [running]:\nruntime/debug.Stack()\n\tC:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/runtime/debug/stack.go:26 +0x5e\ngithub.com/prometheus-community/windows_exporter/pkg/collector.(*Collection).collectCollector.func1.1()\n\tD:/a/windows_exporter/windows_exporter/pkg/collector/collect.go:128 +0x65\npanic({0xca8d20?, 0xc000036cf0?})\n\tC:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/runtime/panic.go:785 +0x132\ngithub.com/prometheus-community/windows_exporter/internal/collector/system.(*Collector).Collect(0xc0000b9420, 0xc0001fe690)\n\tD:/a/windows_exporter/windows_exporter/internal/collector/system/system.go:165 +0x4d0\ngithub.com/prometheus-community/windows_exporter/pkg/collector.(*Collection).collectCollector.func1()\n\tD:/a/windows_exporter/windows_exporter/pkg/collector/collect.go:135 +0x82\ncreated by github.com/prometheus-community/windows_exporter/pkg/collector.(*Collection).collectCollector in goroutine 28221\n\tD:/a/windows_exporter/windows_exporter/pkg/collector/collect.go:124 +0x1aa\n"

Our exporter version is the latest: v0.30.7 I can attach event log if required, but I checked it many times and there is no correlation between the log message and the failure.

First, I'd like to enable debug log and see for any meaningful messages. @jkroepke could I ask you to explain how to get crash log or smth similar? I think the issue isn't handled by windows event log, so that's why we're missing the true reason.


Scrape interval is configured for 30 seconds. The configuration is the default one (empty config file). Windows information: Microsoft Windows Server 2019 Datacenter 10.0.17763 N/A Build 17763

mdraevich avatar Jun 16 '25 21:06 mdraevich

could I ask you to explain how to get crash log or smth similar? I think the issue isn't handled by windows event log, so that's why we're missing the true reason.

Normally, crashes and other output are sent to the program’s standard output. This is the default behavior for the underlying language and is usually visible in a console session.

However, if the program is running as a service, all messages sent to standard output are discarded by Windows. To make logs visible, the program must explicitly send each log line to the Windows Event Log.

But if the program crashes, it can’t execute any additional logic to send logs. That’s a design flaw in the Windows ecosystem.

For reference: The last log line contains a panic:

panic in collector system: runtime error: index out of range [0] with length 0. stack: goroutine 28222 [running]:
runtime/debug.Stack()
	C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/runtime/debug/stack.go:26 +0x5e
github.com/prometheus-community/windows_exporter/pkg/collector.(*Collection).collectCollector.func1.1()
	D:/a/windows_exporter/windows_exporter/pkg/collector/collect.go:128 +0x65
panic({0xca8d20?, 0xc000036cf0?})
	C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/runtime/panic.go:785 +0x132
github.com/prometheus-community/windows_exporter/internal/collector/system.(*Collector).Collect(0xc0000b9420, 0xc0001fe690)
	D:/a/windows_exporter/windows_exporter/internal/collector/system/system.go:165 +0x4d0
github.com/prometheus-community/windows_exporter/pkg/collector.(*Collection).collectCollector.func1()
	D:/a/windows_exporter/windows_exporter/pkg/collector/collect.go:135 +0x82
created by github.com/prometheus-community/windows_exporter/pkg/collector.(*Collection).collectCollector in goroutine 28221
	D:/a/windows_exporter/windows_exporter/pkg/collector/collect.go:124 +0x1aa

It's a panic is an abnormal program termination, which was catched by the windows_exporter. It could be possible that locks are not released which resulting in a deadlock. I will lock into that.

jkroepke avatar Jun 17 '25 06:06 jkroepke

Could you test, if the builds from #2083 solves your issue?

https://github.com/prometheus-community/windows_exporter/actions/runs/15701278525/artifacts/3342932208

jkroepke avatar Jun 17 '25 07:06 jkroepke

Thank you for the quick reply.

I'll test several times with different collectors enabled + debug log level. I'll share the results to you in a week or two.

mdraevich avatar Jun 17 '25 20:06 mdraevich

Could someone check, if the build from #2098

https://github.com/prometheus-community/windows_exporter/actions/runs/16035426302/artifacts/3453077335

resolves this issues as well?

jkroepke avatar Jul 03 '25 06:07 jkroepke

Sorry for the late reply — I’ve been testing different cases. Yes, the latest build completely solves the issue with the non-working config. Thank you very much!

mdraevich avatar Jul 10 '25 11:07 mdraevich