windows_exporter icon indicating copy to clipboard operation
windows_exporter copied to clipboard

Service start timeout inside container servercore:ltsc2019 since v0.17.0

Open gillg opened this issue 3 years ago • 9 comments

Hello,

I discovered during a test to upgrade, that any version >= 0.17.0 are not more able to start as service inside a container. If I launch it in CLI it works, but as service it fails.

I highly suspect https://github.com/prometheus-community/windows_exporter/pull/863 because even if IsWindowsService is the good practice it seems has been rewrited pretty recently in golang codebase due to bugs.

So maybe a golang update could solve the issue, else adding a workaround to "force" the service mode should be considered. There is a similar trick on otel collector project : https://github.com/open-telemetry/opentelemetry-collector/blob/7ed3f75ef84d9e9d11b175a0859060f765faca0b/docs/troubleshooting.md#startup-failing-in-windows-docker-containers and used here https://github.com/open-telemetry/opentelemetry-collector/blob/4439e9b49c4de55bdc050ee4928b5b0c79c317cb/cmd/builder/internal/builder/templates/main_windows.go.tmpl#L32

gillg avatar Feb 21 '22 21:02 gillg

I highly suspect #863 because even if IsWindowsService is the good practice it seems has been rewrited pretty recently in golang codebase due to bugs.

Do you know which Golang release(s) contains these re-writes? I think it would be worth testing newer Golang versions to see if this is fixed. If not, checking an environment variable similar to the opentelemetry-collector link you provided would be an option.

breed808 avatar Mar 12 '22 22:03 breed808

I would say why not test the latest golang release... I found the commits some times ago but they are probably dispatched across several releases. Any concern about using the latest golang version ?

gillg avatar Mar 12 '22 22:03 gillg

I don't have an problem with that, but it should be tested against this issue.

I'm not able to run the container with my current setup, would you be OK to build the image using the latest Golang version and test?

breed808 avatar Mar 12 '22 22:03 breed808

@breed808 I can test it but I don't succeed to build if for now. Is there some prerequisites to install before launching the makefile ?

gillg avatar Mar 23 '22 08:03 gillg

You'll need promu installed to build the executable via the makefile. Building the image will also require Docker or a substitute like Podman.

breed808 avatar Apr 06 '22 10:04 breed808

I've tried to build it with the latest go version v1.18.1, however I still aren't able to run it inside Docker. Here is the repo: https://github.com/LoSunny/windows_exporter I've modified the build script in Github Action, as those action won't complete with the default settings. The build artifacts can be downloaded here: https://github.com/LoSunny/windows_exporter/releases/tag/vtest Error is the same as issue #962 mentioned above

LoSunny avatar May 08 '22 03:05 LoSunny

I've tried to build it with the latest go version v1.18.1, however I still aren't able to run it inside Docker.

To clarify, is the exporter unable to run as a service or via a CLI command?

breed808 avatar May 18 '22 08:05 breed808

via cli: C:>windows_exporter.exe time="2022-05-25T15:26:21Z" level=fatal msg="CreateObject SWbemLocator error: Invalid class string" source="exporter.go:254"

I tried to run in powershell container:lts-nanoserver-1809 to version 0.14 and unfortunately in every case there is the same error. Running from cli - .\windows_exporter.exe

hpoznanski avatar May 25 '22 15:05 hpoznanski

Hello,

Sorry for the big delay... Using the version v0.19.0 I can launch it without any problem as CLI.

PS C:\ChocoTests> & '.\windows_exporter_amd64.exe'
time="2022-09-21T09:59:59+02:00" level=warning msg="No where-clause specified for service collector. This will generate a very large number of metrics!" 
source="service.go:48"
time="2022-09-21T09:59:59+02:00" level=info msg="Running as User Manager\\ContainerAdministrator" source="exporter.go:355"
time="2022-09-21T09:59:59+02:00" level=warning msg="Running as a preconfigured Windows Container user. This may mean you do not have Windows HostProcess 
containers configured correctly and some functionality will not work as expected." source="exporter.go:357"
time="2022-09-21T09:59:59+02:00" level=info msg="Enabled collectors: logical_disk, net, os, service, system, textfile, cpu, cs" source="exporter.go:360" 
time="2022-09-21T09:59:59+02:00" level=info msg="Starting windows_exporter (version=0.19.0, branch=heads/tags/v0.19.0, revision=752d467b123798309c5a57c8b7d47267f2f46565)" source="exporter.go:412"
time="2022-09-21T09:59:59+02:00" level=info msg="Build context (go=go1.18.3, user=runneradmin@fv-az282-285, date=20220723-09:43:37)" source="exporter.go:413"
time="2022-09-21T09:59:59+02:00" level=info msg="Starting server on :9182" source="exporter.go:416"
time="2022-09-21T09:59:59+02:00" level=info msg="TLS is disabled." source="gokit_adapter.go:38"

But if I start the installed service the process seems crash when the process starts :

The Windows_Exporter service failed to start due to the following error: %%1053
A timeout was reached (30000 milliseconds) while waiting for the Windows_Exporter service to connect.

@hpoznanski Using a "nanoserver" is probably not a good idea and by experience the windows version 1809 is far from perfect. The good containers from microsoft starts at ltsc2019.

gillg avatar Sep 21 '22 08:09 gillg

@gillg is the container and/or node under load when starting as a service? While unlikely, it could be related to the timeout issue in #551.

breed808 avatar Sep 23 '22 21:09 breed808

@breed808 not at all, I just start manually a vanilla container, download and run win exporter on it. As cli no issue, as service the service times out (because the exporter not really starts)

gillg avatar Sep 24 '22 09:09 gillg

Thanks for the info. We may need to document the issue with running the service in a container; I don't use Windows containers so I wouldn't be able to debug this issue.

breed808 avatar Oct 02 '22 22:10 breed808

Ah ! I thought it was related to goland fwk itself but I juste discovered working on something else that it's part of the package "x/sys" https://pkg.go.dev/golang.org/x/[email protected]/windows/svc So because we are targeting a version 0.0.0-snapshot.... we should bump the dependancy to the v0.1.0 !

I take the bets it will solve the issue ! :)

gillg avatar Oct 18 '22 20:10 gillg

@breed808 I can finaly take some time to take a look deeper. For now my tests are not great but at least I'm able to build and launch win exporter in a container ! I keep you in touch

gillg avatar Oct 18 '22 21:10 gillg

OK.... I thought the IsWindowsService() was used, but I just discovered I was completely out of the way ! It has been completely removed by https://github.com/prometheus-community/windows_exporter/pull/1046 @jammiemil any thought in the current issue related to your change ?

To summarize, if you launch win_exporter inside a container as CLI it works, but if you launch it as service it crashes, or timeout, or never start (hard to say preceisely)

EDIT: my bad, your PR has never been merged, but you have a commit on master... https://github.com/prometheus-community/windows_exporter/commit/a5f22ebb04cfea8c0b8132b4f87d194c2e6a5aab

gillg avatar Oct 18 '22 21:10 gillg

OK.... I thought the IsWindowsService() was used, but I just discovered I was completely out of the way ! It has been completely removed by https://github.com/prometheus-community/windows_exporter/pull/1046 @jammiemil any thought in the current issue related to your change ?

To summarize, if you launch win_exporter inside a container as CLI it works, but if you launch it as service it crashes, or timeout, or never start (hard to say preceisely)

EDIT: my bad, your PR has never been merged, but you have a commit on master... https://github.com/prometheus-community/windows_exporter/commit/a5f22ebb04cfea8c0b8132b4f87d194c2e6a5aab

Have you tried v0.20 as it should include my change which attempts to workaround the issue you seem to be describing by 'starting' the windows service as early as possible rather than waiting for all the dependencies to load.

In a 'typical' Windows server this was happening due to a lack of resources (cpu) to load the dependencies within the 30s timeout foe a Windows service so you may have been having a similar issue in containers?

jammiemil avatar Oct 18 '22 21:10 jammiemil

#551 contains a lot of the background on this.

jammiemil avatar Oct 18 '22 21:10 jammiemil

Awsome @jammiemil I also face that issue on regular hosts ! Also, when I stop the service, sometimes it seems to be still running, probably because it was detected as crashed and relaunched instead of a basic stop. I'm currently playing on the master branch, and I trided to bump /x/sys to 0.1.0 instead of 0.0.0-snapshot. But nothing better at this stage... So I'm in a v0.20+

Moreover I have the feeling we never enter in the init() function of initiate package, because I don't see any log like Checking if We are a service

gillg avatar Oct 18 '22 21:10 gillg

Yeah I saw that happen occasionally even with the rejigged init to try to start asap, ultimately the workaround I put in place stops a good chunk of the failures on startup but it can still happen because under certain conditions it can take the underlying golang subroutines more than 30 seconds to start, there's pretty much nothing you can do about that in any particular codebase as far as I can tell, but I will admit my golang abilities are limited so I'm very much open to a more robust solution, I know the guys working on Grafana Agent are trying to come up with something a little more robust than my 'fudge' to resolve the same issue I that repo.

jammiemil avatar Oct 18 '22 22:10 jammiemil

Slight correction to my previous comment. the delay is either in the underlying golang subroutines OR the remaining dependencies like sys

jammiemil avatar Oct 18 '22 22:10 jammiemil

So ! Thanks a lot @jammiemil for crossing informations here ! That was useful for me but not for the bug itself ^^ BUT ! I found the solution. The Golang framework has some issues with windows world...! It's not the first I encounter. The function /x/sys/svc.IsAnInteractiveSession() is officialy deprecated because not working well (an interractive CLI inside a container is detected as non interractive for example). But the "correct" function IsWindowsService() seems not return that a service is a service when it runs inside a container... (so win exporter fails as service).

I finaly reverted the initiator to use IsAnInteractiveSession() and reused the same logic implemented in open telemtry collector to have a way to force an interractive mode if needed by an env var NO_WINDOWS_SERVICE. I would like to go deeper in my troubleshooting to find the root cause in the golang function, but that seems not easy to make tests....

gillg avatar Oct 19 '22 21:10 gillg

I created an issue on the x/sys lib to follow that error https://github.com/golang/go/issues/56335 The fix seems simple, but needs an external eye.

gillg avatar Oct 19 '22 22:10 gillg

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

github-actions[bot] avatar Nov 25 '23 02:11 github-actions[bot]

Hi all.

Any clue how I can mitigate this issue? :)

Great thanks in advance.

dansimov04012022 avatar Jan 14 '24 03:01 dansimov04012022