consul icon indicating copy to clipboard operation
consul copied to clipboard

Failed to start up the cluster when upgrading to 1.13.0

Open tuxillo opened this issue 2 years ago • 0 comments

Overview of the Issue

Upgrading from 1.12.3 to 1.13.0 leads to a failure to start the cluster.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. We have a 3-node cluster
  2. Upgrade nodes to 1.13.0
  3. The nodes won't start up due to an error (see log fragments below)

Consul info for both Client and Server

Client info
$ consul info
Error querying agent: Get "https://localhost:8501/v1/agent/self": dial tcp [::1]:8501: connect: connection refused
Server info
$ consul info
Error querying agent: Get "https://localhost:8501/v1/agent/self": dial tcp [::1]:8501: connect: connection refused

Operating system and Environment details

Centos 7 LXC running on Proxmox.

Linux consul-01.mysite 5.11.22-5-pve #1 SMP PVE 5.11.22-10 (Tue, 28 Sep 2021 08:15:41 +0200) x86_64 x86_64 x86_64 GNU/Linux

Log Fragments

Aug 10 07:40:59 consul-01.mysite consul[1683882]: ==> Starting Consul agent...
Aug 10 07:40:59 consul-01.mysite consul[1683882]:            Version: '1.13.0'
Aug 10 07:40:59 consul-01.mysite consul[1683882]:         Build Date: '2022-08-09 18:28:11 +0000 UTC'
Aug 10 07:40:59 consul-01.mysite consul[1683882]:            Node ID: '1aaaf50c-d91e-8d64-ec70-ed9d68341ff3'
Aug 10 07:40:59 consul-01.mysite consul[1683882]:          Node name: 'consul-01'
Aug 10 07:40:59 consul-01.mysite consul[1683882]:         Datacenter: 'mysite' (Segment: '<all>')
Aug 10 07:40:59 consul-01.mysite consul[1683882]:             Server: true (Bootstrap: false)
Aug 10 07:40:59 consul-01.mysite consul[1683882]:        Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: 8501, gRPC: 8502, DNS: 8600)
Aug 10 07:40:59 consul-01.mysite consul[1683882]:       Cluster Addr: 192.168.10.11 (LAN: 8301, WAN: 8302)
Aug 10 07:40:59 consul-01.mysite consul[1683882]:            Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: true, Auto-Encrypt-TLS: true
Aug 10 07:40:59 consul-01.mysite consul[1683882]: ==> Log data will now stream in as it occurs:
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN]  agent: The 'ca_file' field is deprecated. Use the 'tls.defaults.ca_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN]  agent: The 'cert_file' field is deprecated. Use the 'tls.defaults.cert_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN]  agent: The 'key_file' field is deprecated. Use the 'tls.defaults.key_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN]  agent: The 'verify_incoming' field is deprecated. Use the 'tls.defaults.verify_incoming' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN]  agent: The 'verify_incoming_rpc' field is deprecated. Use the 'tls.internal_rpc.verify_incoming' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN]  agent: The 'verify_outgoing' field is deprecated. Use the 'tls.defaults.verify_outgoing' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN]  agent: The 'verify_server_hostname' field is deprecated. Use the 'tls.internal_rpc.verify_server_hostname' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN]  agent: bootstrap_expect > 0: expecting 3 servers
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN]  agent.auto_config: The 'ca_file' field is deprecated. Use the 'tls.defaults.ca_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN]  agent.auto_config: The 'cert_file' field is deprecated. Use the 'tls.defaults.cert_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN]  agent.auto_config: The 'key_file' field is deprecated. Use the 'tls.defaults.key_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN]  agent.auto_config: The 'verify_incoming' field is deprecated. Use the 'tls.defaults.verify_incoming' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN]  agent.auto_config: The 'verify_incoming_rpc' field is deprecated. Use the 'tls.internal_rpc.verify_incoming' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN]  agent.auto_config: The 'verify_outgoing' field is deprecated. Use the 'tls.defaults.verify_outgoing' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN]  agent.auto_config: The 'verify_server_hostname' field is deprecated. Use the 'tls.internal_rpc.verify_server_hostname' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN]  agent.auto_config: bootstrap_expect > 0: expecting 3 servers
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.489Z [INFO]  agent.server.raft: starting restore from snapshot: id=1651-7004807-1660117093745 last-index=7004807 last-term=1651 size-in-bytes=149375
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.489Z [INFO]  agent.server.raft: snapshot restore progress: id=1651-7004807-1660117093745 last-index=7004807 last-term=1651 size-in-bytes=149375 read-bytes=70 percent-complete=0.05%
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.490Z [ERROR] agent.server.raft: failed to restore snapshot: id=1651-6988421-1660047942935 last-index=6988421 last-term=1651 size-in-bytes=149004 error="object missing primary index"
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.490Z [INFO]  agent.server: shutting down server
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.491Z [ERROR] agent: Error starting agent: error="Failed to start Consul server: Failed to start Raft: failed to load any existing snapshots"

tuxillo avatar Aug 10 '22 09:08 tuxillo

@tuxillo , I couldn't reproduce the error probably I have a very small snapshot from v1.12.3

2022-08-10T22:31:56.355-0400 [INFO]  agent.server.raft: starting restore from snapshot: id=2-22-1660185101472 last-index=22 last-term=2 size-in-bytes=8564
2022-08-10T22:31:56.360-0400 [INFO]  agent.server.raft: snapshot restore progress: id=2-22-1660185101472 last-index=22 last-term=2 size-in-bytes=8564 read-bytes=8564 percent-complete=100.00%
2022-08-10T22:31:56.360-0400 [INFO]  agent.server.raft: restored from snapshot: id=2-22-1660185101472 last-index=22 last-term=2 size-in-bytes=8564
  • Will you be able to restart the server using the old version of consul, v1.12.3?

  • What is the output of consul snapshot inspect <snap-file>. The snapshot file should be located in the consul data dir, e.g., /path-to-data-dir/raft/snapshots/2-22-1660185101472/state.bin

huikang avatar Aug 11 '22 02:08 huikang

Seeing the same (repeatably). Downgrade to 1.12.3 works fine.

# consul snapshot inspect /opt/consul/raft/snapshots/1598-16767572-1660192717741/state.bin
 ID           1598-16767572-1660192717741
 Size         5882700
 Index        16767572
 Term         1598
 Version      1

 Type                        Count      Size
 ----                        ----       ----
 KVS                         3331       5.4MB
 Register                    173        220.3KB
 ACLToken                    25         10.7KB
 ACLPolicy                   10         3.2KB
 Index                       81         2.8KB
 CoordinateBatchUpdate       14         2.3KB
 Tombstone                   11         1.7KB
 ConnectCA                   1          1.2KB
 ConnectCAProviderState      1          1.2KB
 Session                     3          586B
 ConfigEntry                 1          353B
 FederationState             2          306B
 SystemMetadata              3          215B
 ConnectCAConfig             1          205B
 Autopilot                   1          199B
 ServiceVirtualIP            2          114B
 FreeVirtualIP               1          33B
 ChunkingState               1          12B
 ----                        ----       ----
 Total                                  5.6MB

With the perhaps slightly more useful log message: [ERROR] agent.server.raft: failed to restore snapshot: id=1596-16751183-1660135204652 last-index=16751183 last-term=1596 size-in-bytes=5887294 error="object missing primary index"

Full startup logs below.

Aug 11 07:52:03 consulserver.example.com systemd[1]: Started Consul Agent.
Aug 11 07:52:03 consulserver.example.com consul[268773]: ==> Starting Consul agent...
Aug 11 07:52:03 consulserver.example.com consul[268773]:            Version: '1.13.0'
Aug 11 07:52:03 consulserver.example.com consul[268773]:         Build Date: '2022-08-09 18:28:11 +0000 UTC'
Aug 11 07:52:03 consulserver.example.com consul[268773]:            Node ID: '8bcdd91a-4810-60ab-b2a3-f127b32dfc69'
Aug 11 07:52:03 consulserver.example.com consul[268773]:          Node name: 'consulserver'
Aug 11 07:52:03 consulserver.example.com consul[268773]:         Datacenter: 'dc1' (Segment: '<all>')
Aug 11 07:52:03 consulserver.example.com consul[268773]:             Server: true (Bootstrap: false)
Aug 11 07:52:03 consulserver.example.com consul[268773]:        Client Addr: [127.0.0.1] (HTTP: -1, HTTPS: 8500, gRPC: 8502, DNS: 8600)
Aug 11 07:52:03 consulserver.example.com consul[268773]:       Cluster Addr: 81.187.154.149 (LAN: 8301, WAN: 8302)
Aug 11 07:52:03 consulserver.example.com consul[268773]:            Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: true, Auto-Encrypt-TLS: true
Aug 11 07:52:03 consulserver.example.com consul[268773]: ==> Log data will now stream in as it occurs:
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.648Z [WARN]  agent: bootstrap_expect > 0: expecting 3 servers
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.662Z [WARN]  agent.auto_config: bootstrap_expect > 0: expecting 3 servers
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.670Z [INFO]  agent.server.raft: starting restore from snapshot: id=1598-16767572-1660192717741 last-index=16767572 last-term=1598 size-in-bytes=5882700
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.679Z [INFO]  agent.server.raft: snapshot restore progress: id=1598-16767572-1660192717741 last-index=16767572 last-term=1598 size-in-bytes=5882700 read-bytes=62 percent-complete=0.00%
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.680Z [ERROR] agent.server.raft: failed to restore snapshot: id=1598-16767572-1660192717741 last-index=16767572 last-term=1598 size-in-bytes=5882700 error="object missing primary index"
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.680Z [INFO]  agent.server.raft: starting restore from snapshot: id=1596-16751183-1660135204652 last-index=16751183 last-term=1596 size-in-bytes=5887294
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.689Z [INFO]  agent.server.raft: snapshot restore progress: id=1596-16751183-1660135204652 last-index=16751183 last-term=1596 size-in-bytes=5887294 read-bytes=62 percent-complete=0.00%
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.689Z [ERROR] agent.server.raft: failed to restore snapshot: id=1596-16751183-1660135204652 last-index=16751183 last-term=1596 size-in-bytes=5887294 error="object missing primary index"
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.689Z [INFO]  agent.server: shutting down server
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.689Z [ERROR] agent: Error starting agent: error="Failed to start Consul server: Failed to start Raft: failed to load any existing snapshots"
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.689Z [INFO]  agent: Exit code: code=1
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Main process exited, code=exited, status=1/FAILURE
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Failed with result 'exit-code'.
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Service RestartSec=100ms expired, scheduling restart.
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Scheduled restart job, restart counter is at 6.
Aug 11 07:52:03 consulserver.example.com systemd[1]: Stopped Consul Agent.
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Start request repeated too quickly.
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Failed with result 'exit-code'.
Aug 11 07:52:03 consulserver.example.com systemd[1]: Failed to start Consul Agent.

Just re-checked and no additional logs are available with level set to debug.

optiz0r avatar Aug 11 '22 07:08 optiz0r

I can confirm that we had to downgrade to 1.12.3 to bring back the cluster. In another cluster we have we left 2 nodes in 1.12.3 and one in 1.13.0, which doesn't boot.

Here's the inspect of the snapshot that fails to restore:

$ consul snapshot inspect /var/consul/raft/snapshots/1651-6973752-1659985741788/state.bin
 ID           1651-6973752-1659985741788
 Size         148533
 Index        6973752
 Term         1651
 Version      1

 Type                        Count      Size
 ----                        ----       ----
 Register                    72         58.6KB
 KVS                         5          50.1KB
 ACLToken                    33         14.5KB
 ACLPolicy                   23         8.7KB
 ConfigEntry                 13         4KB
 ConnectCA                   1          2.7KB
 ConnectCAProviderState      1          1.9KB
 CoordinateBatchUpdate       11         1.8KB
 Index                       30         1KB
 Session                     3          612B
 FederationState             3          441B
 SystemMetadata              3          211B
 Autopilot                   1          199B
 ConnectCAConfig             1          180B
 Tombstone                   1          89B
 ServiceVirtualIP            1          70B
 FreeVirtualIP               1          33B
 ChunkingState               1          12B
 ----                        ----       ----
 Total                                  145.1KB

tuxillo avatar Aug 11 '22 08:08 tuxillo

Same problem on a 3 nodes test cluster, running on AlmaLinux 8. Downgraded to 1.12.3 to have the agent able to start

dani avatar Aug 11 '22 08:08 dani

FYI - Same here on a single-node dev cluster. Snapshot info:

 ID           32-2392224-1659978506821
 Size         73385
 Index        2392224
 Term         32
 Version      1

 Type                        Count      Size
 ----                        ----       ----
 Register                    56         60KB
 ConfigEntry                 13         4.2KB
 Index                       48         2.1KB
 KVS                         12         1.6KB
 ConnectCA                   1          1.2KB
 ConnectCAProviderState      1          1.2KB
 ServiceVirtualIP            8          471B
 SystemMetadata              3          215B
 ConnectCAConfig             1          205B
 Autopilot                   1          199B
 CoordinateBatchUpdate       1          174B
 FederationState             1          139B
 FreeVirtualIP               1          33B
 ChunkingState               1          12B
 ----                        ----       ----
 Total                                  71.7KB

MikeN123 avatar Aug 11 '22 13:08 MikeN123

@MikeN123 , thanks so much for sharing; it's really useful. To help troubleshooting this issue, do you mind sharing the agent configuration and the step to reproduce?

huikang avatar Aug 11 '22 13:08 huikang

Here's mine

data_dir = "/opt/consul/data"
bind_addr = "0.0.0.0"
client_addr = "0.0.0.0"

advertise_addr = "10.99.5.18"

ports {
  dns = 8600
  http = 8500
  https = 8501
  grpc = 8502
  serf_lan = 8301
  serf_wan = 8302
  server = 8300
  sidecar_min_port = 21000
  sidecar_max_port = 21255
  expose_min_port = 21500
  expose_max_port = 21755
}

retry_join = [
  "nomad1.example.org",
  "nomad2.example.org",
  "nomad3.example.org",
]

server = true
bootstrap_expect = 3
performance {
  raft_multiplier = 1
}

encrypt = "<hidden>"

ui_config {
  enabled = true
}

recursors = [
  "10.99.5.1",
]

telemetry {
  prometheus_retention_time = "1h"
}

connect {
  enabled = true
}

log_level = "INFO"
  
node_meta {
}

Started by this systemd unit

[Unit]
Description="HashiCorp Consul - A service mesh solution"
Documentation=https://www.consul.io/
Requires=network-online.target
After=network-online.target
ConditionFileNotEmpty=/opt/consul/etc/consul.hcl

[Service]
Type=notify
EnvironmentFile=-/opt/consul/etc/consul.env
User=consul
Group=consul
ExecStart=/opt/consul/bin/consul agent -config-dir=/opt/consul/etc/
ExecReload=/bin/kill --signal HUP $MAINPID
SuccessExitStatus=1
Restart=on-failure
RestartSec=2
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Nothing special to reproduce, just running a consul cluster in 1.12.3 (which has been installed and bootstraped with 1.12.3), change the binary to the 1.13.0 version and restart the service

dani avatar Aug 11 '22 13:08 dani

Config here (1.12.3 works fine, 1.13.0 does not start) - almost completely default except for enabling Connect.

data_dir = "/opt/consul"
client_addr = "0.0.0.0"
ui = true
server = true

bootstrap_expect = 1

connect {
  enabled = true
}

ports {
  grpc = 8502
}
bind_addr = "{{ GetInterfaceIP \"ens192\" }}"

Steps to reproduce - no idea, just upgraded the node which has been running for several versions and never had issues with upgrades.

Maybe it's related to Connect, as all installs above seem to have that in common?

MikeN123 avatar Aug 11 '22 13:08 MikeN123

We have found the root cause and are on fixing the issue. Thanks to all for helping troubleshoot the issue!

huikang avatar Aug 11 '22 14:08 huikang

Thank you everyone for the reports. We narrowed down the issue to a breaking change that has been addressed in 1.13.1.

This bug in 1.13.0 affects those upgrading from Consul 1.11+ AND using Connect service mesh (i.e. Connect proxies registered). Everyone is advised to upgrade to 1.13.1 instead.

kisunji avatar Aug 12 '22 13:08 kisunji