consul icon indicating copy to clipboard operation
consul copied to clipboard

windows service health check

Open rismoney opened this issue 3 years ago • 13 comments

Agents should have a native health check to determine if a service is running. Many services dont have network listeners and handle either outbound, or local service capability. Avoiding process creation/kill of calling a cmd sc/pwsh cmdlet to check the existence of a service would be great to just handle in native go.

I cant imagine this to be a big ask, as other health checks are seemingly simplistic functions.

rismoney avatar Jul 19 '21 02:07 rismoney

Hi @rismoney,

It sounds like there's a service with no network listeners that you want to have health checks on. Rather than using a Script + Interval check, you'd prefer a native check to see if the service/process is running.

If I've understood correctly (let me know), I have a few follow-up questions:

  • Is a Script + Interval check what you're using instead today? If so, what are you hoping would improve/change for your use case as a result of switching to a native check?
  • What configuration do you imagine passing in? (e.g., service name, interval, anything else?)
  • What information do you imagine the health status (passing, warning, critical) being based on? (e.g., based on ServiceController.Status, or something else)

jkirschner-hashicorp avatar Jul 30 '21 01:07 jkirschner-hashicorp

It sounds like there's a service with no network listeners that you want to have health checks on. Rather than using a Script Interval check, you'd prefer a native check to see if the service/process is running.

Yes.

Is a Script + Interval check what you're using instead today? If so, what are you hoping would improve/change for your use case as a result of switching to a native check?

This is what we would use. I am running very latency sensitive financial transaction applications. Spawning to cmd/powershell would get expensive as the service count were to rise invoking a get-service/sc or similar operation. We are trying to do service inventory on 500 nodes with 25-50 custom services per node, and would not want the health check to be shell/script based with another process invocation. Also my understanding is any approach that uses wmi calls isn't going to scale well (the prometheus exporter kind of taught that lesson)

What configuration do you imagine passing in? (e.g., service name, interval, anything else?)

Yes, unfortunately the only thing we will know is it's name and that it's running so those two attributes are nearly all we have. That itself would be the health check. I know it's kind of primitive, The applications with listeners are also fairly sensitive, and we don't want to interrogate them by port. They have their own built in health capabilities, that essentially will down themselves if unhealthy.

What information do you imagine the health status

I think it's fairly boolean. running or not. I suppose there are some edge cases around "starting, stopping, etc) but I'd punt on those, and assume anything not running is unhealthy. Those are typically backoff timer based, and will timeout/succeed eventually.

While my initial use case is windows, I would imagine similar capabilities for systemd services would be a feature on parity with this.

I am relatively new to exploring consul, and thought this capability might open up a huge monitoring integration for us, where we can dynamically determine how to monitor things based on what services are active where in the environment. If we had to powershell/sc to get it for each service it would potentially be super expensive as the service count rises.

rismoney avatar Jul 30 '21 17:07 rismoney

@rismoney :

We are trying to do service inventory on 500 nodes with 25-50 custom services per node

Are all of the 25-50 services per node unique services? Or do you have multiple instances of some services on a single node?

I ask because if service name is the identifying information used for the health check, I imagine that would be problematic with multiple instances of a service on a single node.

jkirschner-hashicorp avatar Aug 04 '21 20:08 jkirschner-hashicorp

The services are uniquely named and require a directory with the associated app/config in each dir. They could be the same app, but their config/name is unique. I am not sure windows even allows duplicate service names on the same node.

To clarify 2 services named: myco-svctype-svcname-001 would reside in C:\myservice\myco-svctype-svcname-001 myco-svctype-svcname-002 would reside in C:\myservice\myco-svctype-svcname-002

The underlying appcode might be the same, but are redundantly stored on the filesystem and have different associated conf files. At no point would the service names not be unique. Also I don't think you can have 2 windows services that point to the same path\to\filename.exe ... So there is that too.

So duplication of services wouldn't occur on the same node, but could occur across nodes (parallel stuff). So myco-svctype-svcname-001 could be running on 3 nodes.

I hope that clarifies.

rismoney avatar Aug 04 '21 21:08 rismoney

Hello Consul community members,

We would welcome a PR contributed by the community for this enhancement!

If you're interested, please comment here so anyone interested can stay informed. We also recommend sharing a written design approach here before proceeding with implementation to ensure we're aligned from the start.

The approach should ensure the following:

  • The configuration entries for defining the health check must be OS agnostic (e.g., could work for a Windows Service or systemd on Linux), so that this enhancement could be extended to other OSs in the future.
  • The implementation need only support Windows for now.
    • Why? The overhead to create a new process on Windows seems to be much, much higher than it is on Linux (via fork), so the need is greatest on Windows.
    • The documentation should explain the workaround to get the same functionality on Linux (i.e., write a script that does the check and use local script checks)

Implementing automated tests for this may not be straightforward, as (to the best of my knowledge) we don't have unit tests that run specifically on Windows within Circle CI right now. The HashiCorp team can discuss that with whomever takes on this enhancement.

jkirschner-hashicorp avatar Aug 09 '21 22:08 jkirschner-hashicorp

Hi there, I bumped into this issue and I would like to contribute. I am a huge Hashicorp fan and shamelessly wannabe Hashicorpian too 😄

That said, I wanted to try something "odd", considering that this issue requires sharing a design beforehand. I tried following the RFC template and I came up with this public gist:

https://gist.github.com/deblasis/b6dd645a8bd5e2b3c9b95bb30bbb238c

Any feedback is more than welcome.

Cheers

deblasis avatar Dec 01 '21 13:12 deblasis

Hi @deblasis,

Thank you so much for writing this up - and for taking the initiative to find and apply our RFC template!

The engineering team will take a look soon and provide feedback directly on the Gist (likely in the next 1-2 weeks). In the meantime, I might provide some feedback on UX considerations (also on the Gist).

jkirschner-hashicorp avatar Dec 07 '21 04:12 jkirschner-hashicorp

Hi @jkirschner-hashicorp, you are welcome! I tried, hopefully, it's easier this way.

Thank you for the update. It sounds fantastic! Take your time. Best

deblasis avatar Dec 07 '21 14:12 deblasis

The team has blocked off some time next week to go through your RFC! We appreciate this initiative and will get back to you with valuable feedback

kisunji avatar Dec 15 '21 18:12 kisunji

any update or blockers on moving this forward?

rismoney avatar May 29 '22 22:05 rismoney

Hi @rismoney, it is completely my fault I guess. The team provided feedback, we are aligned with the design and there are no blockers. I just had to write the relevant code but in the meantime, I started working on a big project and I didn't manage to follow up with my OSS commitments. Sorry about that. I booked some time to look at this over the weekend. Stay tuned!

deblasis avatar May 31 '22 08:05 deblasis

Hi folks, I have some news, I have updated the Gist (PRD) for the issue and you can see the diff here.

I have also written some code in #13388 and I would like some feedback and some direction regarding your preferred way of mocking syscalls. I could attempt doing it "my own way", but I'd rather stay consistent if you guys have a pattern that's already used elsewhere and would like to be replicated here.

deblasis avatar Jun 07 '22 17:06 deblasis

Hey everyone, @deblasis has a PR implementing Windows service health checks: https://github.com/hashicorp/consul/pull/13388

If anyone would like to try it out and give feedback, that would be much appreciated.

kisunji avatar Sep 16 '22 19:09 kisunji

@rismoney : This has been closed by #13388 thanks to the hard work of @deblasis!

The functionality will first be available in the 1.14.0 release.

jkirschner-hashicorp avatar Oct 17 '22 22:10 jkirschner-hashicorp

amazing work, from original documentation to merging!!! kudos and @deblasis would be an asset to any org!

rismoney avatar Oct 18 '22 02:10 rismoney

Thank you @rismoney ! In hindsight, it took me way too long to stay on top of this given work commitments and I hate unfinished business but I am very happy to see my code finally merged upstream again. Always nice to be working with you guys, I always learn something new🙏

Farewell, until next time :)

deblasis avatar Oct 18 '22 07:10 deblasis