consul
consul copied to clipboard
Failed to start up the cluster when upgrading to 1.13.0
Overview of the Issue
Upgrading from 1.12.3 to 1.13.0 leads to a failure to start the cluster.
Reproduction Steps
Steps to reproduce this issue, eg:
- We have a 3-node cluster
- Upgrade nodes to 1.13.0
- The nodes won't start up due to an error (see log fragments below)
Consul info for both Client and Server
Client info
$ consul info
Error querying agent: Get "https://localhost:8501/v1/agent/self": dial tcp [::1]:8501: connect: connection refused
Server info
$ consul info
Error querying agent: Get "https://localhost:8501/v1/agent/self": dial tcp [::1]:8501: connect: connection refused
Operating system and Environment details
Centos 7 LXC running on Proxmox.
Linux consul-01.mysite 5.11.22-5-pve #1 SMP PVE 5.11.22-10 (Tue, 28 Sep 2021 08:15:41 +0200) x86_64 x86_64 x86_64 GNU/Linux
Log Fragments
Aug 10 07:40:59 consul-01.mysite consul[1683882]: ==> Starting Consul agent...
Aug 10 07:40:59 consul-01.mysite consul[1683882]: Version: '1.13.0'
Aug 10 07:40:59 consul-01.mysite consul[1683882]: Build Date: '2022-08-09 18:28:11 +0000 UTC'
Aug 10 07:40:59 consul-01.mysite consul[1683882]: Node ID: '1aaaf50c-d91e-8d64-ec70-ed9d68341ff3'
Aug 10 07:40:59 consul-01.mysite consul[1683882]: Node name: 'consul-01'
Aug 10 07:40:59 consul-01.mysite consul[1683882]: Datacenter: 'mysite' (Segment: '<all>')
Aug 10 07:40:59 consul-01.mysite consul[1683882]: Server: true (Bootstrap: false)
Aug 10 07:40:59 consul-01.mysite consul[1683882]: Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: 8501, gRPC: 8502, DNS: 8600)
Aug 10 07:40:59 consul-01.mysite consul[1683882]: Cluster Addr: 192.168.10.11 (LAN: 8301, WAN: 8302)
Aug 10 07:40:59 consul-01.mysite consul[1683882]: Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: true, Auto-Encrypt-TLS: true
Aug 10 07:40:59 consul-01.mysite consul[1683882]: ==> Log data will now stream in as it occurs:
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN] agent: The 'ca_file' field is deprecated. Use the 'tls.defaults.ca_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN] agent: The 'cert_file' field is deprecated. Use the 'tls.defaults.cert_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN] agent: The 'key_file' field is deprecated. Use the 'tls.defaults.key_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN] agent: The 'verify_incoming' field is deprecated. Use the 'tls.defaults.verify_incoming' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN] agent: The 'verify_incoming_rpc' field is deprecated. Use the 'tls.internal_rpc.verify_incoming' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN] agent: The 'verify_outgoing' field is deprecated. Use the 'tls.defaults.verify_outgoing' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN] agent: The 'verify_server_hostname' field is deprecated. Use the 'tls.internal_rpc.verify_server_hostname' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN] agent: bootstrap_expect > 0: expecting 3 servers
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN] agent.auto_config: The 'ca_file' field is deprecated. Use the 'tls.defaults.ca_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN] agent.auto_config: The 'cert_file' field is deprecated. Use the 'tls.defaults.cert_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN] agent.auto_config: The 'key_file' field is deprecated. Use the 'tls.defaults.key_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN] agent.auto_config: The 'verify_incoming' field is deprecated. Use the 'tls.defaults.verify_incoming' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN] agent.auto_config: The 'verify_incoming_rpc' field is deprecated. Use the 'tls.internal_rpc.verify_incoming' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN] agent.auto_config: The 'verify_outgoing' field is deprecated. Use the 'tls.defaults.verify_outgoing' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN] agent.auto_config: The 'verify_server_hostname' field is deprecated. Use the 'tls.internal_rpc.verify_server_hostname' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN] agent.auto_config: bootstrap_expect > 0: expecting 3 servers
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.489Z [INFO] agent.server.raft: starting restore from snapshot: id=1651-7004807-1660117093745 last-index=7004807 last-term=1651 size-in-bytes=149375
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.489Z [INFO] agent.server.raft: snapshot restore progress: id=1651-7004807-1660117093745 last-index=7004807 last-term=1651 size-in-bytes=149375 read-bytes=70 percent-complete=0.05%
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.490Z [ERROR] agent.server.raft: failed to restore snapshot: id=1651-6988421-1660047942935 last-index=6988421 last-term=1651 size-in-bytes=149004 error="object missing primary index"
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.490Z [INFO] agent.server: shutting down server
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.491Z [ERROR] agent: Error starting agent: error="Failed to start Consul server: Failed to start Raft: failed to load any existing snapshots"
@tuxillo , I couldn't reproduce the error probably I have a very small snapshot from v1.12.3
2022-08-10T22:31:56.355-0400 [INFO] agent.server.raft: starting restore from snapshot: id=2-22-1660185101472 last-index=22 last-term=2 size-in-bytes=8564
2022-08-10T22:31:56.360-0400 [INFO] agent.server.raft: snapshot restore progress: id=2-22-1660185101472 last-index=22 last-term=2 size-in-bytes=8564 read-bytes=8564 percent-complete=100.00%
2022-08-10T22:31:56.360-0400 [INFO] agent.server.raft: restored from snapshot: id=2-22-1660185101472 last-index=22 last-term=2 size-in-bytes=8564
-
Will you be able to restart the server using the old version of consul, v1.12.3?
-
What is the output of
consul snapshot inspect <snap-file>
. The snapshot file should be located in the consul data dir, e.g., /path-to-data-dir/raft/snapshots/2-22-1660185101472/state.bin
Seeing the same (repeatably). Downgrade to 1.12.3 works fine.
# consul snapshot inspect /opt/consul/raft/snapshots/1598-16767572-1660192717741/state.bin
ID 1598-16767572-1660192717741
Size 5882700
Index 16767572
Term 1598
Version 1
Type Count Size
---- ---- ----
KVS 3331 5.4MB
Register 173 220.3KB
ACLToken 25 10.7KB
ACLPolicy 10 3.2KB
Index 81 2.8KB
CoordinateBatchUpdate 14 2.3KB
Tombstone 11 1.7KB
ConnectCA 1 1.2KB
ConnectCAProviderState 1 1.2KB
Session 3 586B
ConfigEntry 1 353B
FederationState 2 306B
SystemMetadata 3 215B
ConnectCAConfig 1 205B
Autopilot 1 199B
ServiceVirtualIP 2 114B
FreeVirtualIP 1 33B
ChunkingState 1 12B
---- ---- ----
Total 5.6MB
With the perhaps slightly more useful log message: [ERROR] agent.server.raft: failed to restore snapshot: id=1596-16751183-1660135204652 last-index=16751183 last-term=1596 size-in-bytes=5887294 error="object missing primary index"
Full startup logs below.
Aug 11 07:52:03 consulserver.example.com systemd[1]: Started Consul Agent.
Aug 11 07:52:03 consulserver.example.com consul[268773]: ==> Starting Consul agent...
Aug 11 07:52:03 consulserver.example.com consul[268773]: Version: '1.13.0'
Aug 11 07:52:03 consulserver.example.com consul[268773]: Build Date: '2022-08-09 18:28:11 +0000 UTC'
Aug 11 07:52:03 consulserver.example.com consul[268773]: Node ID: '8bcdd91a-4810-60ab-b2a3-f127b32dfc69'
Aug 11 07:52:03 consulserver.example.com consul[268773]: Node name: 'consulserver'
Aug 11 07:52:03 consulserver.example.com consul[268773]: Datacenter: 'dc1' (Segment: '<all>')
Aug 11 07:52:03 consulserver.example.com consul[268773]: Server: true (Bootstrap: false)
Aug 11 07:52:03 consulserver.example.com consul[268773]: Client Addr: [127.0.0.1] (HTTP: -1, HTTPS: 8500, gRPC: 8502, DNS: 8600)
Aug 11 07:52:03 consulserver.example.com consul[268773]: Cluster Addr: 81.187.154.149 (LAN: 8301, WAN: 8302)
Aug 11 07:52:03 consulserver.example.com consul[268773]: Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: true, Auto-Encrypt-TLS: true
Aug 11 07:52:03 consulserver.example.com consul[268773]: ==> Log data will now stream in as it occurs:
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.648Z [WARN] agent: bootstrap_expect > 0: expecting 3 servers
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.662Z [WARN] agent.auto_config: bootstrap_expect > 0: expecting 3 servers
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.670Z [INFO] agent.server.raft: starting restore from snapshot: id=1598-16767572-1660192717741 last-index=16767572 last-term=1598 size-in-bytes=5882700
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.679Z [INFO] agent.server.raft: snapshot restore progress: id=1598-16767572-1660192717741 last-index=16767572 last-term=1598 size-in-bytes=5882700 read-bytes=62 percent-complete=0.00%
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.680Z [ERROR] agent.server.raft: failed to restore snapshot: id=1598-16767572-1660192717741 last-index=16767572 last-term=1598 size-in-bytes=5882700 error="object missing primary index"
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.680Z [INFO] agent.server.raft: starting restore from snapshot: id=1596-16751183-1660135204652 last-index=16751183 last-term=1596 size-in-bytes=5887294
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.689Z [INFO] agent.server.raft: snapshot restore progress: id=1596-16751183-1660135204652 last-index=16751183 last-term=1596 size-in-bytes=5887294 read-bytes=62 percent-complete=0.00%
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.689Z [ERROR] agent.server.raft: failed to restore snapshot: id=1596-16751183-1660135204652 last-index=16751183 last-term=1596 size-in-bytes=5887294 error="object missing primary index"
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.689Z [INFO] agent.server: shutting down server
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.689Z [ERROR] agent: Error starting agent: error="Failed to start Consul server: Failed to start Raft: failed to load any existing snapshots"
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.689Z [INFO] agent: Exit code: code=1
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Main process exited, code=exited, status=1/FAILURE
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Failed with result 'exit-code'.
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Service RestartSec=100ms expired, scheduling restart.
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Scheduled restart job, restart counter is at 6.
Aug 11 07:52:03 consulserver.example.com systemd[1]: Stopped Consul Agent.
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Start request repeated too quickly.
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Failed with result 'exit-code'.
Aug 11 07:52:03 consulserver.example.com systemd[1]: Failed to start Consul Agent.
Just re-checked and no additional logs are available with level set to debug.
I can confirm that we had to downgrade to 1.12.3 to bring back the cluster. In another cluster we have we left 2 nodes in 1.12.3 and one in 1.13.0, which doesn't boot.
Here's the inspect of the snapshot that fails to restore:
$ consul snapshot inspect /var/consul/raft/snapshots/1651-6973752-1659985741788/state.bin
ID 1651-6973752-1659985741788
Size 148533
Index 6973752
Term 1651
Version 1
Type Count Size
---- ---- ----
Register 72 58.6KB
KVS 5 50.1KB
ACLToken 33 14.5KB
ACLPolicy 23 8.7KB
ConfigEntry 13 4KB
ConnectCA 1 2.7KB
ConnectCAProviderState 1 1.9KB
CoordinateBatchUpdate 11 1.8KB
Index 30 1KB
Session 3 612B
FederationState 3 441B
SystemMetadata 3 211B
Autopilot 1 199B
ConnectCAConfig 1 180B
Tombstone 1 89B
ServiceVirtualIP 1 70B
FreeVirtualIP 1 33B
ChunkingState 1 12B
---- ---- ----
Total 145.1KB
Same problem on a 3 nodes test cluster, running on AlmaLinux 8. Downgraded to 1.12.3 to have the agent able to start
FYI - Same here on a single-node dev cluster. Snapshot info:
ID 32-2392224-1659978506821
Size 73385
Index 2392224
Term 32
Version 1
Type Count Size
---- ---- ----
Register 56 60KB
ConfigEntry 13 4.2KB
Index 48 2.1KB
KVS 12 1.6KB
ConnectCA 1 1.2KB
ConnectCAProviderState 1 1.2KB
ServiceVirtualIP 8 471B
SystemMetadata 3 215B
ConnectCAConfig 1 205B
Autopilot 1 199B
CoordinateBatchUpdate 1 174B
FederationState 1 139B
FreeVirtualIP 1 33B
ChunkingState 1 12B
---- ---- ----
Total 71.7KB
@MikeN123 , thanks so much for sharing; it's really useful. To help troubleshooting this issue, do you mind sharing the agent configuration and the step to reproduce?
Here's mine
data_dir = "/opt/consul/data"
bind_addr = "0.0.0.0"
client_addr = "0.0.0.0"
advertise_addr = "10.99.5.18"
ports {
dns = 8600
http = 8500
https = 8501
grpc = 8502
serf_lan = 8301
serf_wan = 8302
server = 8300
sidecar_min_port = 21000
sidecar_max_port = 21255
expose_min_port = 21500
expose_max_port = 21755
}
retry_join = [
"nomad1.example.org",
"nomad2.example.org",
"nomad3.example.org",
]
server = true
bootstrap_expect = 3
performance {
raft_multiplier = 1
}
encrypt = "<hidden>"
ui_config {
enabled = true
}
recursors = [
"10.99.5.1",
]
telemetry {
prometheus_retention_time = "1h"
}
connect {
enabled = true
}
log_level = "INFO"
node_meta {
}
Started by this systemd unit
[Unit]
Description="HashiCorp Consul - A service mesh solution"
Documentation=https://www.consul.io/
Requires=network-online.target
After=network-online.target
ConditionFileNotEmpty=/opt/consul/etc/consul.hcl
[Service]
Type=notify
EnvironmentFile=-/opt/consul/etc/consul.env
User=consul
Group=consul
ExecStart=/opt/consul/bin/consul agent -config-dir=/opt/consul/etc/
ExecReload=/bin/kill --signal HUP $MAINPID
SuccessExitStatus=1
Restart=on-failure
RestartSec=2
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
Nothing special to reproduce, just running a consul cluster in 1.12.3 (which has been installed and bootstraped with 1.12.3), change the binary to the 1.13.0 version and restart the service
Config here (1.12.3 works fine, 1.13.0 does not start) - almost completely default except for enabling Connect.
data_dir = "/opt/consul"
client_addr = "0.0.0.0"
ui = true
server = true
bootstrap_expect = 1
connect {
enabled = true
}
ports {
grpc = 8502
}
bind_addr = "{{ GetInterfaceIP \"ens192\" }}"
Steps to reproduce - no idea, just upgraded the node which has been running for several versions and never had issues with upgrades.
Maybe it's related to Connect, as all installs above seem to have that in common?
We have found the root cause and are on fixing the issue. Thanks to all for helping troubleshoot the issue!
Thank you everyone for the reports. We narrowed down the issue to a breaking change that has been addressed in 1.13.1.
This bug in 1.13.0 affects those upgrading from Consul 1.11+ AND using Connect service mesh (i.e. Connect proxies registered). Everyone is advised to upgrade to 1.13.1 instead.