Consul Windows service is failing during restart
Overview of the Issue
Consul agent is running on Windows Server 2016 as Windows service. After configuration changes the service needs to be restarted and if I do it natively via Services -> Consul -> restart it is failed.
Reproduction Steps
Steps to reproduce this issue, eg:
- Run consul agent as windows service
$serviceName = 'consul'
$bin_dir = "C:\consul\bin"
$config_dir = "C:\consul\config"
$log_dir = "C:\consul\log"
$data_dir = "C:\consul\data"
$params = @{
Name = "$serviceName"
BinaryPathName = "$($bin_dir)\consul.exe agent -data-dir=$($data_dir) -config-dir=$($config_dir) -log-file=$($log_dir)\ -log-rotate-bytes=100000000 -log-rotate-duration=24h -log-rotate-max-files=7"
DisplayName = "$serviceName"
StartupType = "Automatic"
Description = "Hashicorp Consul Windows Service"
}
New-Service @params
Write-Host "Installed service: $serviceName"
Start-Service "$serviceName"
- Make a change to a config or just restart Consul service:
Restart-Service consul
- View error
Restart-Service : Failed to start service 'consul (consul)'.
At line:1 char:1
+ Restart-Service consul
+ ~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : OpenError: (System.ServiceProcess.ServiceController:ServiceController) [Restart-Service]
, ServiceCommandException
+ FullyQualifiedErrorId : StartServiceFailed,Microsoft.PowerShell.Commands.RestartServiceCommand
Consul info for both Client and Server
Client info
PS C:\Windows\system32> C:\consul\bin\consul.exe info
agent:
check_monitors = 0
check_ttls = 0
checks = 1
services = 2
build:
prerelease =
revision = 12f574c9
version = 1.8.1
consul:
acl = disabled
known_servers = 3
server = false
runtime:
arch = amd64
cpu_count = 2
goroutines = 52
max_procs = 2
os = windows
version = go1.14.6
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 6
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 5356
members = 4
query_queue = 0
query_time = 1
Server info
consul info
agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 0
build:
prerelease =
revision = 12f574c9
version = 1.8.1
consul:
acl = enabled
bootstrap = false
known_datacenters = 1
leader = false
leader_addr = fakeip:8300
server = true
raft:
applied_index = 3755454
commit_index = 3755454
fsm_pending = 0
last_contact = 83.075522ms
last_log_index = 3755454
last_log_term = 9
last_snapshot_index = 3744132
last_snapshot_term = 9
latest_configuration = [{Suffrage:Voter ID:45f0fead-cc21-aa1e-f4f0-801c495a3ee2 Address:fakeip:8300} {Suffrage:Voter ID:272743f3-df2c-5e6f-edba-ccca1a6acc21 Address:fakeip:8300} {Suffrage:Voter ID:591f1237-7faf-1d64-d6cd-095eae492760 Address:fakeip:8300}]
latest_configuration_index = 0
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 9
runtime:
arch = amd64
cpu_count = 2
goroutines = 89
max_procs = 2
os = linux
version = go1.14.6
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 6
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 5356
members = 4
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1472
members = 3
query_queue = 0
query_time = 1
Operating system and Environment details
Client: Windows Server 2019 10.0.17763.0 Server: Fedora CoreOS 5.12.7-300.fc34.x86_64
Log Fragments
interesting thing is that I have 2 log files, seems that consul try to start before the first it is stopped fully.
one log file, called consul-1627373028824598600.log
2021-07-27T08:53:26.165+0100 [INFO] agent.client: shutting down client
2021-07-27T08:53:26.165+0100 [ERROR] agent: Error starting agent: error="Failed to start Consul client: Failed to start lan serf: Failed to create memberlist: Could not set up network transport: failed to obtain an address: Failed to start TCP listener on "0.0.0.0" port 8301: listen tcp 0.0.0.0:8301: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted."
2021-07-27T08:53:26.166+0100 [INFO] agent: Exit code: code=1
another one, called consul-1627373037004623800.log
2021-07-27T08:53:03.673+0100 [INFO] agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
2021-07-27T08:53:04.443+0100 [INFO] agent: Synced node info
2021-07-27T08:53:04.470+0100 [INFO] agent: Synced service: service=kitchen
2021-07-27T08:53:04.478+0100 [INFO] agent: Synced service: service=kitchen-from-api
2021-07-27T08:53:24.323+0100 [INFO] agent: Caught: signal=interrupt
2021-07-27T08:53:24.330+0100 [INFO] agent: Gracefully shutting down agent...
2021-07-27T08:53:24.330+0100 [INFO] agent.client: client starting leave
2021-07-27T08:53:24.502+0100 [INFO] agent.client.serf.lan: serf: EventMemberLeave: fake_fqdn fakeip
2021-07-27T08:53:27.700+0100 [INFO] agent: Graceful exit completed
2021-07-27T08:53:27.700+0100 [INFO] agent: Requesting shutdown
2021-07-27T08:53:27.700+0100 [INFO] agent.client: shutting down client
2021-07-27T08:53:27.700+0100 [INFO] agent.client.manager: shutting down
2021-07-27T08:53:27.701+0100 [INFO] agent: consul client down
2021-07-27T08:53:27.701+0100 [INFO] agent: shutdown complete
2021-07-27T08:53:27.702+0100 [INFO] agent: Stopping server: protocol=DNS address=127.0.0.1:8600 network=tcp
2021-07-27T08:53:27.702+0100 [INFO] agent: Stopping server: protocol=DNS address=127.0.0.1:8600 network=udp
2021-07-27T08:53:27.702+0100 [INFO] agent: Stopping server: protocol=HTTP address=127.0.0.1:8500 network=tcp
2021-07-27T08:53:27.702+0100 [INFO] agent: Waiting for endpoints to shut down
2021-07-27T08:53:27.702+0100 [INFO] agent: Endpoints down
2021-07-27T08:53:27.702+0100 [INFO] agent: Exit code: code=0
Consul Client config
{
"node_name": "fake_fqdn",
"client_addr": "127.0.0.1",
"advertise_addr": "fake_ip",
"datacenter": "eu-west-1",
"retry_join": ["provider=aws region=eu-west-1 tag_key=fake_tag_key tag_value=fake_tag_value"],
"leave_on_terminate": true,
"disable_update_check": true,
"log_level": "INFO",
"telemetry": {
"prometheus_retention_time": "10m",
"disable_hostname": true
},
"acl": {
"tokens": {
"agent": "fake_token"
}
}
}
It is worth to mention that issue happens very often (it happened 4 times from 5 restarts).
Hi @ksandrmatveyev,
Thanks for the detailed info - I'll see if I can replicate this on my machine and go from there.
In the meantime, have you found a workaround that is acceptable for now (even if not ideal)? Ideas that come to mind:
- Instead of using
Restart-Service, does it work reliably if you doStop-Service, thenStart-Sleep(maybe for 5-10 seconds, based on your logs), thenStart-Service? - Restart the Windows server running the client when the configuration changes? (much slower restarts than the above workaround, but doesn't depend on figuring out the right sleep duration to use)
I wouldn't expect delayed start to have an effect on a service restart. My understanding is that "delayed start" only affects service start order on boot, not the time delay between Stop-Service and Start-Service (when using Restart-Service).
Yes, forgot to update. Workaround is pretty easy:
Restart-Service consul
Start-Sleep 5
Start-Service consul
Unfortunately, Server restart doesn't help.
P.S. About "Delayed Start", just mentioned in case of any proposals, and seems that I confuse people, sorry. Has been removed
I'm glad you have a suitable workaround for now. Thanks for the update!
Hi,
Are there any updates on this?
@jkirschner-hashicorp

Error in Event Log:
Event 7034
The consul service terminated unexpectedly. It has done this 2 time(s).
C:\Windows\system32>sc query consul
SERVICE_NAME: consul
TYPE : 10 WIN32_OWN_PROCESS
STATE : 4 RUNNING
(STOPPABLE, NOT_PAUSABLE, ACCEPTS_SHUTDOWN)
WIN32_EXIT_CODE : 0 (0x0)
SERVICE_EXIT_CODE : 0 (0x0)
CHECKPOINT : 0x0
WAIT_HINT : 0x0
I'm presuming that Consul service is trying to start while it is not stopped completely.
Apparently this is still an issue with Consul v1.19