windows_exporter icon indicating copy to clipboard operation
windows_exporter copied to clipboard

Windows exporter stopped working after a few minutes

Open ajarillo1981 opened this issue 7 months ago • 40 comments

Current Behavior

Windows exporter stops working after some minutes, no more than 10 minutes. It happens running windows exporter as a service or executing the app manually. I have enabled debug in logs and there is no error, just stop providing logs and url http://localhost:9182/metrics become unaccessible. The service keeps as started but cannot be stopped (only possible finding the PDI and killing the process with "taskkill /f")

I have tried also running with only one collector, but the same results- It is happening the same in the 2 servers that we need to monitor.

¿Is there any other test I can do to see why windows exporter freezes?

Expected Behavior

The application windows exporter should be providing metrics and http://localhost:9182/metrics should be accesible

Steps To Reproduce


Environment

  • windows_exporter Version: 0.30.6
  • Windows Server Version: 2016

windows_exporter logs

time=2025-05-07T12:33:46.611Z level=DEBUG source=collect.go:220 msg="collector memory succeeded after 5.9986ms, resulting in 35 metrics"
time=2025-05-07T12:33:46.612Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 6.9981ms, resulting in 609 metrics"
time=2025-05-07T12:33:46.613Z level=DEBUG source=collect.go:220 msg="collector net succeeded after 7.9976ms, resulting in 27 metrics"
time=2025-05-07T12:33:46.628Z level=DEBUG source=collect.go:220 msg="collector service succeeded after 23.0823ms, resulting in 2872 metrics"
time=2025-05-07T12:34:01.605Z level=DEBUG source=collect.go:220 msg="collector cs succeeded after 0s, resulting in 3 metrics"
time=2025-05-07T12:34:01.606Z level=DEBUG source=collect.go:220 msg="collector os succeeded after 1.0137ms, resulting in 11 metrics"
time=2025-05-07T12:34:01.608Z level=DEBUG source=service.go:282 msg="failed collecting service config" collector=service err="El sistema no puede encontrar el archivo especificado." service=CloudAvUpdaterEx
time=2025-05-07T12:34:01.609Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.9898ms, resulting in 609 metrics"
time=2025-05-07T12:34:01.610Z level=DEBUG source=collect.go:220 msg="collector logical_disk succeeded after 4.9873ms, resulting in 68 metrics"
time=2025-05-07T12:34:01.611Z level=DEBUG source=collect.go:220 msg="collector system succeeded after 5.9913ms, resulting in 9 metrics"
time=2025-05-07T12:34:01.612Z level=DEBUG source=collect.go:220 msg="collector memory succeeded after 7.0304ms, resulting in 35 metrics"
time=2025-05-07T12:34:01.612Z level=DEBUG source=collect.go:220 msg="collector physical_disk succeeded after 7.0304ms, resulting in 24 metrics"
time=2025-05-07T12:34:01.614Z level=DEBUG source=collect.go:220 msg="collector net succeeded after 8.9903ms, resulting in 27 metrics"
time=2025-05-07T12:34:01.631Z level=DEBUG source=collect.go:220 msg="collector service succeeded after 25.9959ms, resulting in 2872 metrics"
time=2025-05-07T12:34:16.605Z level=DEBUG source=collect.go:220 msg="collector cs succeeded after 0s, resulting in 3 metrics"

Anything else?

No response

ajarillo1981 avatar May 12 '25 07:05 ajarillo1981

Start the exporter from an Admin CMD/Powershell and look how it behave.

Some logs can't transport the Windows Event Logs because of how Event Logs are working.

jkroepke avatar May 12 '25 07:05 jkroepke

I have started exporter with Admin Powershell and the behaviour is the same. Now I see the logs "collector succeeded" in Application logs of event viewer and suddenly stop appearing events and URL http://localhost:9182/metrics become unavailable. No other trace. App running and visible in task manager but doing nothing. Do you have any other clue?

ajarillo1981 avatar May 13 '25 06:05 ajarillo1981

could you try with --log.level=debug?

jkroepke avatar May 15 '25 09:05 jkroepke

Debug was already configured in config.yaml. However, I have tried --log.level=debug directly in command line and power shell and the result is exactly the same; it collects metrics and they are available in the URL http://localhost:9182/metrics for promethus. After a few minutes no more logs and http is not available anymore, but no log saying why stop working. This is happening in the 2 servers that I manage.

ajarillo1981 avatar May 16 '25 07:05 ajarillo1981

If it's stop working, does the program terminate?

Could you reduce the --collectors.enabled to an minimum? e.g. --collectors.enabled=cpu maybe one specific collector the faulty.

Additionally. please add --debug.enabled an open http://localhost:9182/debug/pprof/goroutine and post the content here.

jkroepke avatar May 16 '25 08:05 jkroepke

Tested with the proposed parameteres, just one collector and debug enabled and it still stop working after a couple of minutes:

C:\Users\admin2\Desktop>windows_exporter-0.30.6-amd64.exe --collectors.enabled=cpu --log.level=debug --debug.enabled time=2025-05-16T09:30:09.278Z level=DEBUG source=main.go:153 msg="logging has Started" time=2025-05-16T09:30:12.271Z level=INFO source=main.go:265 msg="Running as HYPER1\admin2" time=2025-05-16T09:30:12.271Z level=INFO source=main.go:185 msg="Enabled collectors: cpu" time=2025-05-16T09:30:12.272Z level=INFO source=main.go:203 msg="starting windows_exporter in 2.9950023s" version=0.30.6 branch=HEAD revision=db60c78f32185083354f16fb9d534a021f0d85f9 goversion=go1.23.4 builddate=20250406-11:43:27 maxprocs=32 time=2025-05-16T09:30:12.274Z level=INFO source=tls_config.go:347 msg="Listening on" address=[::]:9182 time=2025-05-16T09:30:12.274Z level=INFO source=tls_config.go:350 msg="TLS is disabled." http2=false address=[::]:9182 time=2025-05-16T09:30:15.433Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 2.9971ms, resulting in 609 metrics" time=2025-05-16T09:30:30.429Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.0006ms, resulting in 609 metrics" time=2025-05-16T09:30:45.430Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.9971ms, resulting in 609 metrics" time=2025-05-16T09:31:00.431Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 5.0098ms, resulting in 609 metrics" time=2025-05-16T09:31:15.430Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.9869ms, resulting in 609 metrics" time=2025-05-16T09:31:20.229Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.9981ms, resulting in 609 metrics" time=2025-05-16T09:31:30.428Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 2.9951ms, resulting in 609 metrics" time=2025-05-16T09:31:40.322Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 4.0001ms, resulting in 609 metrics" time=2025-05-16T09:31:41.504Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.001ms, resulting in 609 metrics" time=2025-05-16T09:31:42.386Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.0132ms, resulting in 609 metrics" time=2025-05-16T09:31:43.168Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.0001ms, resulting in 609 metrics" time=2025-05-16T09:31:43.856Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 4.0025ms, resulting in 609 metrics" time=2025-05-16T09:31:45.009Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.9947ms, resulting in 609 metrics" time=2025-05-16T09:31:45.429Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.9971ms, resulting in 609 metrics" time=2025-05-16T09:31:45.696Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.0024ms, resulting in 609 metrics" time=2025-05-16T09:31:46.497Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.9986ms, resulting in 609 metrics" time=2025-05-16T09:31:47.408Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 4.0367ms, resulting in 609 metrics" time=2025-05-16T09:31:48.161Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.0025ms, resulting in 609 metrics" time=2025-05-16T09:31:48.944Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 2.9971ms, resulting in 609 metrics" time=2025-05-16T09:31:50.448Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.0494ms, resulting in 609 metrics" time=2025-05-16T09:31:51.345Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.9624ms, resulting in 609 metrics" time=2025-05-16T09:31:52.130Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 4.0015ms, resulting in 609 metrics" time=2025-05-16T09:31:52.928Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3ms, resulting in 609 metrics" time=2025-05-16T09:31:53.696Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.9971ms, resulting in 609 metrics" time=2025-05-16T09:31:54.817Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.0015ms, resulting in 609 metrics" time=2025-05-16T09:31:57.410Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.9497ms, resulting in 609 metrics" time=2025-05-16T09:31:58.112Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 2.9927ms, resulting in 609 metrics" time=2025-05-16T09:31:58.817Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 4.0001ms, resulting in 609 metrics" time=2025-05-16T09:31:59.552Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 2.9981ms, resulting in 609 metrics" time=2025-05-16T09:32:00.337Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 3.9624ms, resulting in 609 metrics" time=2025-05-16T09:32:00.427Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 2.0503ms, resulting in 609 metrics" time=2025-05-16T09:32:01.054Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 1.9985ms, resulting in 609 metrics" time=2025-05-16T09:32:15.427Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 2.9971ms, resulting in 609 metrics" time=2025-05-16T09:32:30.427Z level=DEBUG source=collect.go:220 msg="collector cpu succeeded after 2.9986ms, resulting in 609 metrics"

I cannot upload the file downloaded here. So I share via wetransfer: https://we.tl/t-oCcjq878cV

Any help would be appreciated.

ajarillo1981 avatar May 16 '25 09:05 ajarillo1981

Sorry, I need the text ouput of /debug/pprof/goroutine?debug=1

jkroepke avatar May 17 '25 11:05 jkroepke

Output pasted:

goroutine.txt

Thanks for the effort to solve the issue.

ajarillo1981 avatar May 19 '25 10:05 ajarillo1981

Is the output from that path after it has been stop working?

jkroepke avatar May 19 '25 13:05 jkroepke

It was the output when start, now find below the output after it has failed:

goroutine2.txt

ajarillo1981 avatar May 20 '25 06:05 ajarillo1981

Thanks, that helps a bit.

Are you using the collect[] while call the metrics endpoint?

And could you gather the output twice in a 10 second interval?

jkroepke avatar May 20 '25 09:05 jkroepke

This is the config that I have in Prometheus:

  • job_name: hyper-1 honor_timestamps: true scrape_interval: 15s scrape_timeout: 10s metrics_path: /metrics scheme: http follow_redirects: true enable_http2: true static_configs:
    • targets:
      • hyper1:9182

I have been trying to collect 2 outputs as requested but http server also crash and I cannot get it. Do you have any idea about where the issue can be? or anything to try in order to find de cause of the crash?

Thank you for your helpl

ajarillo1981 avatar May 21 '25 09:05 ajarillo1981

What did you mean by crash? is there a log?

jkroepke avatar May 21 '25 10:05 jkroepke

I mean there is no any logs, stops the logs collector telling "cpu succeded" without no error in cmd debug. The url http://localhost:9182/debug/pprof/goroutine?debug=1 and http://localhost:9182/metrics stops resplying when I access from a web browser. It is like the app is running but crashed. Everything works for a few minutes but after that crash.

ajarillo1981 avatar May 21 '25 10:05 ajarillo1981

From the output you provide, I found to suspicious entries:

1 @ 0xe8fade 0xe7d3ce 0xe9a398 0xef89c5 0xf1e8c5 0xf1e055 0xf271f1 0xf271e9 0xfcd844 0xfd553f 0xfd31e4 0x1507ae5 0x1507a03 0x1506f7a 0xe9eb21
#	0xe9a397	syscall.Syscall6+0x37											C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/runtime/syscall_windows.go:463
#	0xef89c4	syscall.WriteConsole+0x84										C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/syscall/zsyscall_windows.go:1149
#	0xf1e8c4	internal/poll.(*FD).writeConsole+0x524									C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/internal/poll/fd_windows.go:741
#	0xf1e054	internal/poll.(*FD).Write+0x234										C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/internal/poll/fd_windows.go:676
#	0xf271f0	os.(*File).write+0x50											C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/os/file_posix.go:46
#	0xf271e8	os.(*File).Write+0x48											C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/os/file.go:195
#	0xfcd843	log/slog.(*commonHandler).handle+0x6a3									C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/log/slog/handler.go:313
#	0xfd553e	log/slog.(*TextHandler).Handle+0x3e									C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/log/slog/text_handler.go:95
#	0xfd31e3	log/slog.(*Logger).logAttrs+0x1e3									C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/log/slog/logger.go:277
#	0x1507ae4	log/slog.(*Logger).LogAttrs+0xa84									C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/log/slog/logger.go:195
#	0x1507a02	github.com/prometheus-community/windows_exporter/pkg/collector.(*Collection).collectCollector+0x9a2	D:/a/windows_exporter/windows_exporter/pkg/collector/collect.go:220
#	0x1506f79	github.com/prometheus-community/windows_exporter/pkg/collector.(*Collection).collectAll.func1+0x99	D:/a/windows_exporter/windows_exporter/pkg/collector/collect.go:64
1 @ 0xe9626e 0xe74552 0xe74526 0xe97405 0xeb9d3d 0xfcd7e5 0xfcd631 0xfd553f 0xfd31e4 0x1507ae5 0x1507a03 0x1506f7a 0xe9eb21
#	0xe97404	sync.runtime_SemacquireMutex+0x24									C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/runtime/sema.go:95
#	0xeb9d3c	sync.(*Mutex).lockSlow+0x15c										C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/sync/mutex.go:173
#	0xfcd7e4	sync.(*Mutex).Lock+0x644										C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/sync/mutex.go:92
#	0xfcd630	log/slog.(*commonHandler).handle+0x490									C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/log/slog/handler.go:311
#	0xfd553e	log/slog.(*TextHandler).Handle+0x3e									C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/log/slog/text_handler.go:95
#	0xfd31e3	log/slog.(*Logger).logAttrs+0x1e3									C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/log/slog/logger.go:277
#	0x1507ae4	log/slog.(*Logger).LogAttrs+0xa84									C:/Users/runneradmin/go/pkg/mod/golang.org/[email protected]/src/log/slog/logger.go:195
#	0x1507a02	github.com/prometheus-community/windows_exporter/pkg/collector.(*Collection).collectCollector+0x9a2	D:/a/windows_exporter/windows_exporter/pkg/collector/collect.go:220
#	0x1506f79	github.com/prometheus-community/windows_exporter/pkg/collector.(*Collection).collectAll.func1+0x99	D:/a/windows_exporter/windows_exporter/pkg/collector/collect.go:64

In this case, the program is in a lock and blocked because it can't write output to the console. This can happen, if to start to select output on Windows Console, e.g. in the quick-edit or mark/select mode.

Any difference, if you set --log.file=log.txt?

jkroepke avatar May 21 '25 15:05 jkroepke

Find the log file attached, there is no difference... log.txt

I have tried several old versions of windows exporter and it is the same in all of them. Is there anything possible to do to find the cause and fix the issue?

ajarillo1981 avatar May 22 '25 15:05 ajarillo1981

Any 3rd party virus scanner on that system?

What about --log.level=ERROR

jkroepke avatar May 22 '25 15:05 jkroepke

I have disabled completely the antivirus and windows exporter fails in the same way after a few minutes. I have tried with --log.level=ERROR and there is not any log.

Please, let me know what else I can do.

ajarillo1981 avatar May 23 '25 06:05 ajarillo1981

I have tried with --log.level=ERROR and there is not any log.

I still need the http://localhost:9182/debug/pprof/goroutine?debug=1 in that scenario, if windows_exporter is not responsive.

jkroepke avatar May 23 '25 06:05 jkroepke

I have been trying for a couple of hours to get the output when it crash, but after the failure of the app, the URL http://localhost:9182/debug/pprof/goroutine?debug=1 doesn't load. Find below the last output that I could get before the crash.

goroutine3.txt

ajarillo1981 avatar May 23 '25 10:05 ajarillo1981

I have installed other Windows Server 2016 in a spare computer inside the same domain with same GPOs and apps and it works as expected without any crash of windows exporter app. I really need to monitor my two servers with windows exporter. Any other ideas about how to troubleshoot the issue?

ajarillo1981 avatar May 28 '25 11:05 ajarillo1981

Finally I could get goroutines after the failure. Find attached 2 of them in 10 seconds interval aprox:

goroutine5.txt goroutine4.txt

ajarillo1981 avatar May 28 '25 12:05 ajarillo1981

I got the info that the output from http://localhost:9182/debug/pprof/goroutine?debug=2 (Note: debug=2) reports more extensive debug Information that might be useful in that context.

Could you provide that as well?

jkroepke avatar May 29 '25 12:05 jkroepke

I have created an script to perform the a curl to the url every 5 seconds and save it into a file with debug=2. Find below the last 2 outputs of 2 failures. After the failure in these cases the URL stops being accesible.

output2025-05-30T07_22_15.txt output2025-05-30T07_22_20.txt output2025-05-30T08_29_41.txt output2025-05-30T08_29_36.txt

I hope you can find any thread to pull. Thnks a lot for you help.

ajarillo1981 avatar May 30 '25 11:05 ajarillo1981

Hm, so the outputs are not quite usefull before crash.

Here is a customized binary https://github.com/prometheus-community/windows_exporter/actions/runs/15346243893/artifacts/3228189181

Which produce a dump on press CTRL+C on console.

jkroepke avatar May 30 '25 12:05 jkroepke

I have tested the binary and it generate the output in console if I press Ctrl+C while the app is working but it doesnt do anything after it crash. Any other idea??

ajarillo1981 avatar Jun 02 '25 07:06 ajarillo1981

No, no idea anymore.

I guess something must be faulty on your system for unknown reasons, since this kind of report is unique.

jkroepke avatar Jun 02 '25 08:06 jkroepke

i have exactly the same problem. It works on two identical servers, but on the third one it crashes for no reason. There is no third-party software, the servers are identical to each other.

iBelow avatar Jun 19 '25 11:06 iBelow

Could you explain, how does it look in detail?

jkroepke avatar Jun 19 '25 15:06 jkroepke

Also, is the CPU usage high, when the crash appears? What the memory usage?

I guess the only way to really debug this kind of issue is using the go debugger named delve.I have not much experience by using delve as standalone binary, since my Editor has delve integrated.

If the problem occurs a non production system, maybe give me a hint.

To run the debugging, the go runtime must be installed first:

Download Go: https://go.dev/doc/install

Installation: https://github.com/go-delve/delve/blob/master/Documentation/installation/README.md

Then, go install github.com/go-delve/delve/cmd/dlv@latest must be executed to install the dlv executable. They should be appear under %USERPROFILE%/go/bin

jkroepke avatar Jun 20 '25 12:06 jkroepke