consul icon indicating copy to clipboard operation
consul copied to clipboard

Consul Windows service is failing during restart

Open ksandrmatveyev opened this issue 4 years ago • 6 comments

Overview of the Issue

Consul agent is running on Windows Server 2016 as Windows service. After configuration changes the service needs to be restarted and if I do it natively via Services -> Consul -> restart it is failed.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Run consul agent as windows service

$serviceName = 'consul'
$bin_dir = "C:\consul\bin"
$config_dir = "C:\consul\config"
$log_dir = "C:\consul\log"
$data_dir = "C:\consul\data"

$params = @{
  Name = "$serviceName"
  BinaryPathName = "$($bin_dir)\consul.exe agent -data-dir=$($data_dir) -config-dir=$($config_dir) -log-file=$($log_dir)\ -log-rotate-bytes=100000000 -log-rotate-duration=24h -log-rotate-max-files=7"
  DisplayName = "$serviceName"
  StartupType = "Automatic"
  Description = "Hashicorp Consul Windows Service"
}
New-Service @params

Write-Host "Installed service: $serviceName"
Start-Service "$serviceName"
  1. Make a change to a config or just restart Consul service:
Restart-Service consul
  1. View error
Restart-Service : Failed to start service 'consul (consul)'.
At line:1 char:1
+ Restart-Service consul
+ ~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OpenError: (System.ServiceProcess.ServiceController:ServiceController) [Restart-Service]
   , ServiceCommandException
    + FullyQualifiedErrorId : StartServiceFailed,Microsoft.PowerShell.Commands.RestartServiceCommand

Consul info for both Client and Server

Client info
PS C:\Windows\system32> C:\consul\bin\consul.exe info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 1
        services = 2
build:
        prerelease =
        revision = 12f574c9
        version = 1.8.1
consul:
        acl = disabled
        known_servers = 3
        server = false
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 52
        max_procs = 2
        os = windows
        version = go1.14.6
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 6
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 5356
        members = 4
        query_queue = 0
        query_time = 1
Server info
consul info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease =
        revision = 12f574c9
        version = 1.8.1
consul:
        acl = enabled
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = fakeip:8300
        server = true
raft:
        applied_index = 3755454
        commit_index = 3755454
        fsm_pending = 0
        last_contact = 83.075522ms
        last_log_index = 3755454
        last_log_term = 9
        last_snapshot_index = 3744132
        last_snapshot_term = 9
        latest_configuration = [{Suffrage:Voter ID:45f0fead-cc21-aa1e-f4f0-801c495a3ee2 Address:fakeip:8300} {Suffrage:Voter ID:272743f3-df2c-5e6f-edba-ccca1a6acc21 Address:fakeip:8300} {Suffrage:Voter ID:591f1237-7faf-1d64-d6cd-095eae492760 Address:fakeip:8300}]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 9
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 89
        max_procs = 2
        os = linux
        version = go1.14.6
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 6
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 5356
        members = 4
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1472
        members = 3
        query_queue = 0
        query_time = 1

Operating system and Environment details

Client: Windows Server 2019 10.0.17763.0 Server: Fedora CoreOS 5.12.7-300.fc34.x86_64

Log Fragments

interesting thing is that I have 2 log files, seems that consul try to start before the first it is stopped fully.

one log file, called consul-1627373028824598600.log

2021-07-27T08:53:26.165+0100 [INFO]  agent.client: shutting down client
2021-07-27T08:53:26.165+0100 [ERROR] agent: Error starting agent: error="Failed to start Consul client: Failed to start lan serf: Failed to create memberlist: Could not set up network transport: failed to obtain an address: Failed to start TCP listener on "0.0.0.0" port 8301: listen tcp 0.0.0.0:8301: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted."
2021-07-27T08:53:26.166+0100 [INFO]  agent: Exit code: code=1

another one, called consul-1627373037004623800.log

2021-07-27T08:53:03.673+0100 [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
2021-07-27T08:53:04.443+0100 [INFO]  agent: Synced node info
2021-07-27T08:53:04.470+0100 [INFO]  agent: Synced service: service=kitchen
2021-07-27T08:53:04.478+0100 [INFO]  agent: Synced service: service=kitchen-from-api
2021-07-27T08:53:24.323+0100 [INFO]  agent: Caught: signal=interrupt
2021-07-27T08:53:24.330+0100 [INFO]  agent: Gracefully shutting down agent...
2021-07-27T08:53:24.330+0100 [INFO]  agent.client: client starting leave
2021-07-27T08:53:24.502+0100 [INFO]  agent.client.serf.lan: serf: EventMemberLeave: fake_fqdn fakeip
2021-07-27T08:53:27.700+0100 [INFO]  agent: Graceful exit completed
2021-07-27T08:53:27.700+0100 [INFO]  agent: Requesting shutdown
2021-07-27T08:53:27.700+0100 [INFO]  agent.client: shutting down client
2021-07-27T08:53:27.700+0100 [INFO]  agent.client.manager: shutting down
2021-07-27T08:53:27.701+0100 [INFO]  agent: consul client down
2021-07-27T08:53:27.701+0100 [INFO]  agent: shutdown complete
2021-07-27T08:53:27.702+0100 [INFO]  agent: Stopping server: protocol=DNS address=127.0.0.1:8600 network=tcp
2021-07-27T08:53:27.702+0100 [INFO]  agent: Stopping server: protocol=DNS address=127.0.0.1:8600 network=udp
2021-07-27T08:53:27.702+0100 [INFO]  agent: Stopping server: protocol=HTTP address=127.0.0.1:8500 network=tcp
2021-07-27T08:53:27.702+0100 [INFO]  agent: Waiting for endpoints to shut down
2021-07-27T08:53:27.702+0100 [INFO]  agent: Endpoints down
2021-07-27T08:53:27.702+0100 [INFO]  agent: Exit code: code=0

Consul Client config

{
  "node_name": "fake_fqdn",
  "client_addr": "127.0.0.1",
  "advertise_addr": "fake_ip",
  "datacenter": "eu-west-1",
  "retry_join": ["provider=aws region=eu-west-1 tag_key=fake_tag_key tag_value=fake_tag_value"],
  "leave_on_terminate": true,
  "disable_update_check": true,
  "log_level": "INFO",
  "telemetry": {
    "prometheus_retention_time": "10m",
    "disable_hostname": true
  },
  "acl": {
    "tokens": {
      "agent": "fake_token"
    }
  }
}

ksandrmatveyev avatar Jul 27 '21 08:07 ksandrmatveyev

It is worth to mention that issue happens very often (it happened 4 times from 5 restarts).

ksandrmatveyev avatar Jul 27 '21 08:07 ksandrmatveyev

Hi @ksandrmatveyev,

Thanks for the detailed info - I'll see if I can replicate this on my machine and go from there.

In the meantime, have you found a workaround that is acceptable for now (even if not ideal)? Ideas that come to mind:

  • Instead of using Restart-Service, does it work reliably if you do Stop-Service, then Start-Sleep (maybe for 5-10 seconds, based on your logs), then Start-Service?
  • Restart the Windows server running the client when the configuration changes? (much slower restarts than the above workaround, but doesn't depend on figuring out the right sleep duration to use)

I wouldn't expect delayed start to have an effect on a service restart. My understanding is that "delayed start" only affects service start order on boot, not the time delay between Stop-Service and Start-Service (when using Restart-Service).

jkirschner-hashicorp avatar Jul 29 '21 00:07 jkirschner-hashicorp

Yes, forgot to update. Workaround is pretty easy:

Restart-Service consul
Start-Sleep 5
Start-Service consul

Unfortunately, Server restart doesn't help.

P.S. About "Delayed Start", just mentioned in case of any proposals, and seems that I confuse people, sorry. Has been removed

ksandrmatveyev avatar Aug 01 '21 20:08 ksandrmatveyev

I'm glad you have a suitable workaround for now. Thanks for the update!

jkirschner-hashicorp avatar Aug 04 '21 20:08 jkirschner-hashicorp

Hi, Are there any updates on this? @jkirschner-hashicorp image image

Error in Event Log:

Event 7034
The consul service terminated unexpectedly.  It has done this 2 time(s).
C:\Windows\system32>sc query consul

SERVICE_NAME: consul
        TYPE               : 10  WIN32_OWN_PROCESS
        STATE              : 4  RUNNING
                                (STOPPABLE, NOT_PAUSABLE, ACCEPTS_SHUTDOWN)
        WIN32_EXIT_CODE    : 0  (0x0)
        SERVICE_EXIT_CODE  : 0  (0x0)
        CHECKPOINT         : 0x0
        WAIT_HINT          : 0x0

I'm presuming that Consul service is trying to start while it is not stopped completely.

ksandrmatveyev avatar Nov 16 '22 15:11 ksandrmatveyev

Apparently this is still an issue with Consul v1.19

SDedik avatar Mar 07 '25 05:03 SDedik