windows_exporter icon indicating copy to clipboard operation
windows_exporter copied to clipboard

windows_exporter v0.14 don't auto start

Open hieu15 opened this issue 4 years ago • 26 comments

Windows_exporter v0.14 don't auto start service when restart server. I must start service manual.

hieu15 avatar Oct 22 '20 02:10 hieu15

I'm experiencing the same problem. I didn't use any version before 0.14 so I don't know whether this version introduced the problem.

A timeout was reached (30000 milliseconds) while waiting for the "windows_exporter" service to connect. 

followed by

The "windows_exporter" service failed to start due to the following error:
The service did not respond to the start or control request in a timely fashion.

Unfortunately, the Windows exporter does not log anything at that time, so there is not much to work with. Starting it manually afterwards has always succeeded. This happens consistently on Windows 2012R2 (scheduled restarts over night) while 2016 and 2019 seem to work fine. I will try this on a clean 2012R2.

Might this be a missing dependency?

fischerman avatar Oct 22 '20 08:10 fischerman

@hieu15 Can you verify if you see the same error messages in the event viewer? Otherwise this might be a different problem.

fischerman avatar Oct 22 '20 08:10 fischerman

I can confirm that this happens on a fresh 2012R2

  • Download ISO from https://www.microsoft.com/en-us/evalcenter/evaluate-windows-server-2012-r2
  • Fresh install on your favorite hypervisor
  • Install Windows exporter
  • The exporter runs fine at this point
  • Restart Windows
  • The exporter is in the "starting" phase for about 30 seconds before the error messages appear in the event viewer

I've tried changing the start up type to "automatic delayed". The exporter starts fine in that case.

fischerman avatar Oct 22 '20 09:10 fischerman

I've tested on a fresh 2016 and it doesn't show any problems, so I've tested on 2012R2 a bit more. This is where things get a bit crazy. Version 0.13 shows the same problems. However, v0.12 works perfectly fine. I've rebooted multiple times, it's always consistent. I did a clean uninstall of the exporter before installing a new one.

According to the release notes the only change in 0.13 is the project name. This seems crazy, but could the name have an effect on the start up order?

fischerman avatar Oct 22 '20 10:10 fischerman

@fischerman any relation to https://github.com/prometheus-community/windows_exporter/issues/544 or https://github.com/prometheus-community/windows_exporter/issues/599 maybe? They all seem to be about not starting exporters (after reboot).

frittentheke avatar Oct 22 '20 11:10 frittentheke

@fischerman That's very odd, but thanks for the exhaustive testing! I just did a check on the full diff between those versions, and there really isn't anything apart from the name changes. There could potentially be some differences coming from the CI build machines updating something (Go compiler could have incremented some patch version, possibly). Would you be able to run a test with the latest version on master? #630 bumped the Go version in order to fix a compiler bug which shouldn't be related to this, but...

carlpett avatar Oct 22 '20 14:10 carlpett

@carlpett I didn't find an installer of master so I installed from 0.14 and swapped the .exe with https://ci.appveyor.com/project/prometheus-community/windows-exporter/builds/35870173/artifacts

As suspected this is not the cause.

@frittentheke I don't think they are related. But it gave me the idea to test 32 bit. Same problem.

fischerman avatar Oct 22 '20 14:10 fischerman

Thanks for checking! Your comment

This seems crazy, but could the name have an effect on the start up order?

... has been bouncing around in the back of my head during the day. And after managing a repro I now have a working theory. Could you test if running sc config windows_exporter depend= wmiApSrv fixes it for you? I can't say this cleanly explains every report, but my thinking goes like this:

  • The service control manager uses lexicographic sorting as a tie-breaker when choosing start order. With the old name, we'd start slightly later.
  • During startup, we make two calls that I think could potentially hang if some dependent service isn't started: https://github.com/prometheus-community/windows_exporter/blob/7eab1fc4118ad31365d06cde6020f5a80a88767e/exporter.go#L316 https://github.com/prometheus/common/blob/cfeb6f9992ffa54aaa4f2170ade4067ee478b250/log/eventlog_formatter.go#L51

Some reasons it could work on other OSes would be that they have something earlier in the startup phase which starts whatever it is we need to be started.

I'm not 100% confident the wmiApSrv is the right dependency, or if it just happens to initialize something as a side effect. It'd had been a much neater explanation if it was also in the Auto start group (it's manual), and if it'd stayed running (on my repro-machine at least it stops quite quickly). Reason I point to the logging initialization is that I can't seem to get anything to log even if I throw it before initWbem, so we must get stuck earlier. It seems a bit odd that logging wouldn't work, though.

But would be super interesting to see if this fixes it for you as well!

(As a side note, renaming the service to eg xwindows_exporter also seems to fix it on my VM)

carlpett avatar Oct 22 '20 21:10 carlpett

I've added the dependency and did 3 restarts. The service came up fine every time. After adding the dependency, the WMI Performance adapter stays running in my case, possibly indefinitely (also in manual). However, in the field I see the latter service constantly flapping.

If the initialization of the event logger is the problem can we temporarily write the error message into a file?

fischerman avatar Oct 26 '20 07:10 fischerman

Thanks for the test! It's a bit unsatisfying to not really understand why, but since it does seem to fix the problem, I think we should probably add it to the next release. The current logger implementation doesn't have a file logger, but we could probably add one fairly easy.

carlpett avatar Nov 01 '20 12:11 carlpett

I did a little bit more testing. I manually wrote stuff in log files. In all the test cases, no log file was written during Windows start up, but were showing up fine when starting the service afterwards. The only exception was a hello world program (built with promu) with which I replaced the windows exporter. That one always produced logs, so filesystem is generally available.

I believe the problem occurs before the main function, hence in one of init functions. I had my eye on the MapCounterToIndex which calls the perflib library internally (the perflib library may panic) but it appears that the problem is even before that (no call to registerCollectors occurs). The only init function in the project that is special is cpu.go, but even after removing the version check, no log is produced. So my best guess is that there is a panic in the init function of one of the dependencies.

I also wrapped the exporter in nssm to get the stderr, but that one starts fine...

fischerman avatar Nov 02 '20 13:11 fischerman

@fischerman I had this same issue with windows_exporter v0.13 on Win2012R2. I found two culprits which contributed to the issue.

  1. Set the windows_exporter service startup to Automatic (Delayed Start).
  2. Apply all Windows Update patches on Win2012R2 In most cases #1 is all it took, but I also discovered stability issues if #2 had not been done. I was unable to determine which patch was affecting windows_exporter, but since fully patching things have worked fine.

pandatracs avatar Nov 05 '20 16:11 pandatracs

Delayed start is also my current workaround. I'm still hoping we can find the root cause. However, if my suspicion is correct and an init function of a dependency causes the crash it will be hard to discover. One idea I just thought of is to generate a core dump using GOTRACEBACK=crash.

fischerman avatar Nov 05 '20 18:11 fischerman

I also have this issue. Thank you @fischerman for all the testing! I was going to suggest moving the windows service handler stuff to the beginning of main() as I believe it is the control messages from the service that makes Windows understand that the service is there. But, if it really crashes before main(), then it is no use.

One idea I just thought of is to generate a core dump using GOTRACEBACK=crash.

Good idea! I think Go doesn't yet produce minidumps without patching: https://github.com/golang/go/issues/20498

It could perhaps be possible to use GOTRACEBACK=crash (or =system) and in some way get stderr redirected to a file to see the tracebacks. I know there are wrappers like winsw that can do that, but it sounds like quite a hassle.

antong avatar Nov 25 '20 13:11 antong

I experience the issue too but what's interesting is that the service crashed within a second but the log message show that it tried for 30 seconds.

A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.

The windows_exporter service failed to start due to the following error: The service did not respond to the start or control request in a timely fashion.

Update: I could only get it fixed by restarting the vm.

basroovers avatar Dec 17 '20 07:12 basroovers

@basroovers Which version do you use? How do you know that the service crashed after one second?

fischerman avatar Dec 17 '20 07:12 fischerman

@fischerman I have hundreds of instances running on v0.15.0.

Most of these didn't autostart after the last Windows update cycle (reboots). One of them couldn't even get the service started and needed the reboot

Here is an export of the Windows event log of that VM:

Level,Date and Time,Source,Event ID,Task Category
Error,17-12-2020 08:19:45,Service Control Manager,7000,None,"The windows_exporter service failed to start due to the following error: 
The service did not respond to the start or control request in a timely fashion."
Error,17-12-2020 08:19:45,Service Control Manager,7009,None,A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.

basroovers avatar Dec 17 '20 08:12 basroovers

The problem occured yesterday on a 2016 server with 0.14. System has now been configured to use the delayed start.

EmmaTinten avatar Dec 18 '20 12:12 EmmaTinten

I also have same issue from time to time using or not "delayed start" – it doesn't matter. And I have an idea about this "time to time". I found that exporter's start failure is more likely to occur in case of reboot after installed updates. Taking into account that process of installing updates continues after restart I think that exporter can not start because some exporter's dependencies are still updating or even not yet started.

Exporter didn't start tonight after automatic reboot in reason of installing updates for Windows Server 2019 Std, but later, after manual reboot it started well.

Updates were installed: February 9, 2021-KB4601887 Cumulative Update for .NET Framework 3.5, 4.7.2 and 4.8 for Windows 10, version 1809 and Windows Server, version 2019 http://support.microsoft.com/kb/4601887 February 9, 2021—KB4601345 (OS Build 17763.1757) https://support.microsoft.com/help/4601345 KB4601393: Servicing stack update for Windows 10, version 1809: February 9, 2021 https://support.microsoft.com/help/4601393

Hope it helps.

olvsa avatar Feb 12 '21 05:02 olvsa

We've updated to v0.16.0, but we experienced the same thing again last weekend. Several Windows updates were installed after which the windows_exporter service was not running.

Same messages:

The windows_exporter service failed to start due to the following error: 
The service did not respond to the start or control request in a timely fashion.

A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.

Perhaps good to check if this is a VMware specific issue? We are running VMware.

basroovers avatar Apr 19 '21 13:04 basroovers

I updated to 0.16.0 right before the April update cycle in our test environment. I also modified the service to Automatic (Delayed Start). Results were mixed. 2008R2 and 2019 did not start windows_exporter after the automatic patch and restart 2012R2 and 2016 servers started windows_exporter successfully after the automatic patch and restart

Tim

On Mon, Apr 19, 2021 at 8:31 AM Bas Roovers @.***> wrote:

We've updated to v0.16.0, but we experienced the same thing again last weekend. Several Windows updates were installed after which the windows_exporter service was not running.

Same messages:

The windows_exporter service failed to start due to the following error: The service did not respond to the start or control request in a timely fashion.

A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.

Perhaps good to check if this is a VMware specific issue? We are running VMware.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/prometheus-community/windows_exporter/issues/637#issuecomment-822468923, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADUP2G4ZOWPW424LLKO3P2TTJQWEVANCNFSM4S2R7NTA .

-- Tim Biles Sys/Database Design/Admin 3 | ITSS | itss.d.umn.edu Storage Champion | z.umn.edu/scn University of Minnesota Duluth | www.d.umn.edu @.*** | 218-726-6959

tbiles avatar Apr 19 '21 13:04 tbiles

Just to confirm, I ran into this issue with v0.16.0

Exporter failed to start on 2 out of 10 servers after cumulative updates.

  • Server2012R2
  • Vmware virtual machine
  • Service set to Automatic (Delayed Start)

David

davideverall avatar May 12 '21 08:05 davideverall

My test results were slightly different than last month. 2008R2 consistently has failed, 2012R2 and 2016 consistently succeed, 2019 is intermittent. All mine are VSphere VMs, all set to Automatic (Delayed Start).

3 of 10 machines failed to start windows_exporter (1 2008R2, 2 2019). 7 machines started windows exporter (4 2012R2, 1 2016, 2 2019)

Tim

On Wed, May 12, 2021 at 3:54 AM davideverall @.***> wrote:

Just to confirm, I ran into this issue with v0.16.0

Exporter failed to start on 2 out of 10 servers after cumulative updates.

  • Server2012R2
  • Vmware virtual machine
  • Service set to Automatic (Delayed Start)

David

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/prometheus-community/windows_exporter/issues/637#issuecomment-839594662, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADUP2GZQ4VGB73J2QGTF4NLTNI63ZANCNFSM4S2R7NTA .

-- Tim Biles Sys/Database Design/Admin 3 | ITSS | itss.d.umn.edu Storage Champion | z.umn.edu/scn University of Minnesota Duluth | www.d.umn.edu @.*** | 218-726-6959

tbiles avatar May 12 '21 17:05 tbiles

I see this issue frequently, occurring typically when there are alot of Windows updates released and available for install.

After the updates are installed and the server is restarted, if the server is under heavy CPU load whilst the updates are installing, then the Windows_exporter service does not start up.

Usually, it is the Windows Modules Installer Worker(TiWorker.exe) service which is using up high CPU on the server to install the updates. I use mainly Windows Server 2016.

Would it be possible to put some logic in whereby the "Windows_exporter" service is checked every x minutes to check its up and running, and if not, to start it up?

Andy-Techical avatar Dec 21 '21 17:12 Andy-Techical

Just to report it, it seems to be happen with v0.16. still, on server 2019...

Get-Package *exporter*

Name                           Version          Source                           ProviderName                                                                                                                                                                     
                                                                                                                                                                                                                 
----                           -------          ------                           ------------                                                                                                                                                                     
                                                                                                                                                                                                                 
windows_exporter               0.16.0   
Get-ComputerInfo | select WindowsProductName, WindowsVersion, OsHardwareAbstractionLayer

WindowsProductName           WindowsVersion OsHardwareAbstractionLayer
------------------           -------------- --------------------------
Windows Server 2019 Standard 1809           10.0.17763.2686   
TimeCreated                     Id LevelDisplayName Message                                                                                                                                                                                                       
                                                                                                                                                                                                                 
-----------                     -- ---------------- -------                                                                                                                                                                                                       
                                                                                                                                                                                                                 
13/04/2022 09:02:53           7036 Information      The windows_exporter service entered the running state.                                                                                                                                                       
                                                                                                                                                                                                                 
13/04/2022 03:35:04           7000 Error            The windows_exporter service failed to start due to the following error: ...                                                                                                                                  
                                                                                                                                                                                                                 
13/04/2022 03:35:04           7009 Error            A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.                                                                                                         
                                                                                                                                                                                                                 
12/04/2022 13:23:43           7036 Information      The windows_exporter service entered the running state.                                                                                                                                                       
                                                                                                                                                                                                                 
12/04/2022 06:11:32           7036 Information      The windows_exporter service entered the stopped state.                                                                                                                                                       
                                                                                                                                                                                                                 
15/03/2022 11:37:37           7036 Information      The windows_exporter service entered the running state.                                                                                                                                                       
                                                                                                                                                                                                                 
09/03/2022 03:32:47           7000 Error            The windows_exporter service failed to start due to the following error: ...                                                                                                                                  
                                                                                                                                                                                                                 
09/03/2022 03:32:47           7009 Error            A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.                                                                                                         
                                                                                                                                                                                                                 
09/03/2022 03:30:42           7036 Information      The windows_exporter service entered the stopped state.                                                                                                                                                       
                                                                                                            

Event at 12/04/2022 13:23:43 was a system start, event at 13/04/2022 09:02:53 is me starting the service. Apparently there were at least one earlier occurance too.

Also it is not isolated, at least one others server shows this behaviour (that one is the same OS version):

TimeCreated                     Id LevelDisplayName Message                                                                                                                                                                                                      
-----------                     -- ---------------- -------                                                                                                                                                                                                      
17/03/2022 10:28:10           7036 Information      The windows_exporter service entered the running state.                                                                                                                                                      
17/03/2022 00:32:36           7000 Error            The windows_exporter service failed to start due to the following error: ...                                                                                                                                 
17/03/2022 00:32:36           7009 Error            A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.                                                                                                        
17/03/2022 00:31:40           7036 Information      The windows_exporter service entered the stopped state. 

I realise we do not run the latest release, so if this issue supposed to be fixed by that, can you confirm - maybe point me to the right direction - so we can consider upgrade?

fsemti avatar Apr 13 '22 09:04 fsemti

Hello,

Same issue for me on Windows server 2019:

-> A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.
-> The windows_exporter service failed to start due to the following error: 
The service did not respond to the start or control request in a timely fashion.

Exporter: 0.18.1

time="2022-08-18T08:08:59Z" level=warning msg="No where-clause specified for service collector. This will generate a very large number of metrics!" source="service.go:48"
time="2022-08-18T08:08:59Z" level=info msg="Enabled collectors: net, os, service, system, textfile, cpu, cs, logical_disk" source="exporter.go:348"
time="2022-08-18T08:08:59Z" level=info msg="Starting windows_exporter ( version=0.18.1, branch=heads/tags/v0.18.1, revision=e07b2053af3ca708db3489f41e2fcde9941860f4)" source="exporter.go:400"
time="2022-08-18T08:08:59Z" level=info msg="Build context (go=go1.17.6, user=runneradmin@xxxxx date=20220205-08:18:35)" source="exporter.go:401"
time="2022-08-18T08:08:59Z" level=info msg="Starting server on :9182" source="exporter.go:404"
time="2022-08-18T08:08:59Z" level=info msg="TLS is disabled." source="gokit_adapter.go:38"

darknesssheep avatar Aug 18 '22 08:08 darknesssheep

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

github-actions[bot] avatar Nov 25 '23 05:11 github-actions[bot]

This should be resolved as of #1047. Let me know if it's still an issue.

breed808 avatar Nov 25 '23 19:11 breed808