vault icon indicating copy to clipboard operation
vault copied to clipboard

When using integrated storage, sometimes nodes fail to join the cluster

Open dpedu opened this issue 2 years ago • 9 comments

Describe the bug Sometimes, when I launch a new vault instance it doesn't join. If I restart vault on the new instance, it still doesn't join. If on the new instance I stop vault, delete its raft database files, and start vault, sometimes it joins. If I delete the instance entirely and create a new one with a new node_id and IP, sometimes it joins. The chance whether or not a node succeeds in joining seems to decrease as time progresses; originally it worked 100% of the time in this cluster.

I have a 5-node vault cluster running in Amazon AWS. I am using integrated storage and AWS tag-based discovery. I have autopilot configured with -cleanup-dead-servers=true -dead-server-last-contact-threshold=30 -min-quorum=3 -server-stabilization-time=30.

My workflow is AMI-based and in order to update things on these systems, such as OS packages, I replace the instance entirely. This means a vault node is removed from the cluster, and eventually another one joins.

I have a script that submits the unseal keys to vault after the process starts and is listening on port 8200.

To Reproduce I don't know :(

Expected behavior Vault instances always join the cluster successfully when asked to do so.

Environment:

  • Vault Server Version: 1.12.0
  • Vault CLI Version: 1.12.0
  • Server Operating System/Architecture: Ubuntu 20.04.5 x64

Vault server configuration file(s):

storage "raft" {
    path = "/var/vault/data"
    node_id = "10.144.82.96"
    performance_multiplier = 1
    retry_join {
        auto_join = "provider=aws region=us-east-1 tag_key=vault_cluster tag_value=ci-vault"
        auto_join_scheme = "http"
    }
}
disable_mlock = true

api_addr = "http://10.144.82.96:8200"
cluster_addr = "http://10.144.82.96:8201"

listener "tcp" {
    address = "0.0.0.0:8200"
    tls_disable = 1
}

default_lease_ttl="24h"
max_lease_ttl="24h"

Additional context The new node that won't join logs:

Expand
Nov 17 03:51:42 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:42.568Z [INFO]  core: security barrier not initialized
Nov 17 03:51:42 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:42.571Z [INFO]  core: security barrier not initialized
Nov 17 03:51:42 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:42.573Z [INFO]  core: security barrier not initialized
Nov 17 03:51:42 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:42.575Z [INFO]  core: security barrier not initialized
Nov 17 03:51:42 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:42.577Z [INFO]  core: security barrier not initialized
Nov 17 03:51:42 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:42.577Z [INFO]  core: security barrier not initialized
Nov 17 03:51:42 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:42.580Z [INFO]  core: security barrier not initialized
Nov 17 03:51:42 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:42.580Z [INFO]  core: security barrier not initialized
Nov 17 03:51:42 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:42.584Z [INFO]  core: security barrier not initialized
Nov 17 03:51:42 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:42.584Z [INFO]  core: waiting for raft retry join process to complete
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.172Z [INFO]  core: security barrier not initialized
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.173Z [INFO]  core: [DEBUG] discover: Using provider "aws"
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.173Z [INFO]  core: [INFO] discover-aws: Address type  is not supported. Valid values are {private_v4,public_v4,public_v6}. Falling back to 'private_v4'
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.173Z [INFO]  core: [DEBUG] discover-aws: Using region=us-east-1 tag_key=vault_cluster tag_value=ci-vault addr_type=private_v4
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.173Z [INFO]  core: [DEBUG] discover-aws: No static credentials
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.173Z [INFO]  core: [DEBUG] discover-aws: Using environment variables, shared credentials or instance role
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.173Z [INFO]  core: [INFO] discover-aws: Region is us-east-1
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.173Z [INFO]  core: [DEBUG] discover-aws: Creating session...
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.173Z [INFO]  core: [INFO] discover-aws: Filter instances with vault_cluster=ci-vault
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [DEBUG] discover-aws: Found 5 reservations
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [DEBUG] discover-aws: Reservation r-09675c763f2c305e2 has 1 instances
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [DEBUG] discover-aws: Found instance i-04646df72f0665815
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [INFO] discover-aws: Instance i-04646df72f0665815 has private ip 10.144.87.99
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [DEBUG] discover-aws: Reservation r-03a4fc0a51fd9c3fe has 1 instances
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [DEBUG] discover-aws: Found instance i-09c43933f82975ca0
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [INFO] discover-aws: Instance i-09c43933f82975ca0 has private ip 10.144.87.88
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [DEBUG] discover-aws: Reservation r-054869488fd0fad3f has 1 instances
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [DEBUG] discover-aws: Found instance i-008623a62f59f1528
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [INFO] discover-aws: Instance i-008623a62f59f1528 has private ip 10.144.87.60
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [DEBUG] discover-aws: Reservation r-05dab5f8bf40ebbc8 has 1 instances
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [DEBUG] discover-aws: Found instance i-083728cc811051025
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [INFO] discover-aws: Instance i-083728cc811051025 has private ip 10.144.80.255
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [DEBUG] discover-aws: Reservation r-0630bec202ad952cf has 1 instances
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [DEBUG] discover-aws: Found instance i-0c40b56578b0abe85
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [INFO] discover-aws: Instance i-0c40b56578b0abe85 has private ip 10.144.82.96
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: [DEBUG] discover-aws: Found ip addresses: [10.144.87.99 10.144.87.88 10.144.87.60 10.144.80.255 10.144.82.96]
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: attempting to join possible raft leader node: leader_addr=http://10.144.87.99:8200
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: attempting to join possible raft leader node: leader_addr=http://10.144.87.88:8200
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: attempting to join possible raft leader node: leader_addr=http://10.144.87.60:8200
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: attempting to join possible raft leader node: leader_addr=http://10.144.80.255:8200
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.234Z [INFO]  core: attempting to join possible raft leader node: leader_addr=http://10.144.82.96:8200
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.248Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=0.0.0.0:8201
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.248Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.251Z [INFO]  storage.raft: creating Raft: config="&raft.Config{ProtocolVersion:3, HeartbeatTimeout:3000000000, ElectionTimeout:3000000000, CommitTimeout:50000000, MaxAppendEntries:64, BatchApplyCh:true, ShutdownOnRemove:true, TrailingLogs:0x2800, SnapshotInterval:120000000000, SnapshotThreshold:0x2000, LeaderLeaseTimeout:500000000, LocalID:\"10.144.82.96\", NotifyCh:(chan<- bool)(0xc000e70620), LogOutput:io.Writer(nil), LogLevel:\"DEBUG\", Logger:(*hclog.interceptLogger)(0xc00061bd70), NoSnapshotRestoreOnStart:true, skipStartup:false}"
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.255Z [INFO]  storage.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:10.144.87.99 Address:10.144.87.99:8201} {Suffrage:Voter ID:10.144.80.255 Address:10.144.80.255:8201} {Suffrage:Voter ID:10.144.87.60 Address:10.144.87.60:8201} {Suffrage:Voter ID:10.144.87.88 Address:10.144.87.88:8201} {Suffrage:Nonvoter ID:10.144.82.96 Address:10.144.82.96:8201}]"
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.255Z [INFO]  core: successfully joined the raft cluster: leader_addr=http://10.144.87.99:8200
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.255Z [INFO]  core: security barrier not initialized
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.255Z [ERROR] core: failed to get raft challenge: leader_addr=http://10.144.82.96:8200
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]:   error=
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]:   | error during raft bootstrap init call: Error making API request.
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]:   |
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]:   | URL: PUT http://10.144.82.96:8200/v1/sys/storage/raft/bootstrap/challenge
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]:   | Code: 503. Errors:
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]:   |
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]:   | * Vault is sealed
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]:   
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.257Z [INFO]  storage.raft: entering follower state: follower="Node at 10.144.82.96:8201 [Follower]" leader-address= leader-id=
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.258Z [INFO]  core: security barrier not initialized
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.349Z [WARN]  storage.raft: failed to get previous log: previous-index=4082508 last-index=1 error="log not found"
Nov 17 03:51:54 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:54.270Z [INFO]  core: security barrier not initialized
Nov 17 03:51:54 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:54.982Z [INFO]  storage.raft.snapshot: creating new snapshot: path=/var/vault/data/raft/snapshots/13-4082511-1668657114982.tmp
Nov 17 03:52:04 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:52:04.283Z [INFO]  core: security barrier not initialized
Nov 17 03:52:04 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:52:04.979Z [INFO]  storage.raft: snapshot network transfer progress: read-bytes=724123648 percent-complete="55.66%"
Nov 17 03:52:14 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:52:14.294Z [INFO]  core: security barrier not initialized
Nov 17 03:52:14 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:52:14.980Z [INFO]  storage.raft: snapshot network transfer progress: read-bytes=1107230720 percent-complete="85.11%"
Nov 17 03:52:20 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:52:20.547Z [INFO]  storage.raft: copied to local snapshot: bytes=1300945608
Nov 17 03:52:20 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:52:20.548Z [INFO]  storage.raft.fsm: installing snapshot to FSM
Nov 17 03:52:20 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:52:20.549Z [INFO]  storage.raft.fsm: snapshot installed
Nov 17 03:52:20 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:52:20.677Z [INFO]  storage.raft: snapshot restore progress: id=13-4082511-1668657114982 last-index=4082511 last-term=13 size-in-bytes=0 read-bytes=0 percent-complete="NaN%"
Nov 17 03:52:20 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:52:20.677Z [INFO]  storage.raft: Installed remote snapshot
Nov 17 03:52:21 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:52:21.275Z [WARN]  core: cluster listener is already started
Nov 17 03:52:21 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:52:21.275Z [INFO]  core: vault is unsealed
Nov 17 03:52:21 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:52:21.275Z [INFO]  core: entering standby mode
Nov 17 03:52:21 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:52:21.825Z [WARN]  storage.raft: not part of stable configuration, aborting election
Nov 17 03:52:24 ci-vault-20221117035056245390 systemd[1]: Started vault server.
Nov 17 03:54:35 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:54:35.039Z [INFO]  storage.raft: starting snapshot up to: index=4082511
Nov 17 03:54:35 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:54:35.040Z [INFO]  storage.raft: snapshot complete up to: index=4082511

Leader node's logs during the join:

Expand
Nov 17 03:51:44 ci-vault-20221117013305501659 vault[1372]: 2022-11-17T03:51:44.246Z [INFO]  storage.raft: updating configuration: command=AddNonvoter server-id=10.144.82.96 server-addr=10.144.82.96:8201 servers="[{Suffrage:Voter ID:10.144.87.99 Address:10.144.87.99:8201} {Suffrage:Voter ID:10.144.80.255 Address:10.144.80.255:8201} {Suffrage:Voter ID:10.144.87.60 Address:10.144.87.60:8201} {Suffrage:Voter ID:10.144.87.88 Address:10.144.87.88:8201} {Suffrage:Nonvoter ID:10.144.82.96 Address:10.144.82.96:8201}]"
Nov 17 03:51:44 ci-vault-20221117013305501659 vault[1372]: 2022-11-17T03:51:44.248Z [INFO]  storage.raft: added peer, starting replication: peer=10.144.82.96
Nov 17 03:51:44 ci-vault-20221117013305501659 vault[1372]: 2022-11-17T03:51:44.248Z [ERROR] storage.raft: failed to appendEntries to: peer="{Nonvoter 10.144.82.96 10.144.82.96:8201}" error="dial tcp 10.144.82.96:8201: connect: connection refused"
Nov 17 03:51:44 ci-vault-20221117013305501659 vault[1372]: 2022-11-17T03:51:44.253Z [INFO]  system: follower node answered the raft bootstrap challenge: follower_server_id=10.144.82.96
Nov 17 03:51:44 ci-vault-20221117013305501659 vault[1372]: 2022-11-17T03:51:44.359Z [WARN]  storage.raft: appendEntries rejected, sending older logs: peer="{Nonvoter 10.144.82.96 10.144.82.96:8201}" next=2
Nov 17 03:51:44 ci-vault-20221117013305501659 vault[1372]: 2022-11-17T03:51:44.359Z [ERROR] storage.raft: failed to get log: index=1 error="log not found"
Nov 17 03:52:20 ci-vault-20221117013305501659 vault[1372]: 2022-11-17T03:52:20.462Z [INFO]  storage.raft.autopilot: Attempting removal of failed server node: id=10.144.82.96 name=10.144.82.96 address=
Nov 17 03:52:20 ci-vault-20221117013305501659 vault[1372]: 2022-11-17T03:52:20.463Z [INFO]  storage.raft: removing dead server from raft configuration: id=10.144.82.96
Nov 17 03:52:20 ci-vault-20221117013305501659 vault[1372]: 2022-11-17T03:52:20.463Z [INFO]  storage.raft: updating configuration: command=RemoveServer server-id=10.144.82.96 server-addr= servers="[{Suffrage:Voter ID:10.144.87.99 Address:10.144.87.99:8201} {Suffrage:Voter ID:10.144.80.255 Address:10.144.80.255:8201} {Suffrage:Voter ID:10.144.87.60 Address:10.144.87.60:8201} {Suffrage:Voter ID:10.144.87.88 Address:10.144.87.88:8201}]"
Nov 17 03:52:20 ci-vault-20221117013305501659 vault[1372]: 2022-11-17T03:52:20.466Z [INFO]  storage.raft: removed peer, stopping replication: peer=10.144.82.96 last-index=4082738

These two messages look of interest:

Nov 17 03:51:44 ci-vault-20221117013305501659 vault[1372]: 2022-11-17T03:51:44.248Z [ERROR] storage.raft: failed to appendEntries to: peer="{Nonvoter 10.144.82.96 10.144.82.96:8201}" error="dial tcp 10.144.82.96:8201: connect: connection refused"

And

Nov 17 03:51:44 ci-vault-20221117013305501659 vault[1372]: 2022-11-17T03:51:44.359Z [ERROR] storage.raft: failed to get log: index=1 error="log not found"

I cannot explain the "connection refused" beyond that vault is not (yet?) listening on the port.

This looks somewhat similar to #8489.

This also seems odd, but I'm not sure:

Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.255Z [ERROR] core: failed to get raft challenge: leader_addr=http://10.144.82.96:8200
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]:   error=
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]:   | error during raft bootstrap init call: Error making API request.
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]:   |
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]:   | URL: PUT http://10.144.82.96:8200/v1/sys/storage/raft/bootstrap/challenge
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]:   | Code: 503. Errors:
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]:   |
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]:   | * Vault is sealed
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]:   

It's trying to fetch the challenge from itself - Is that normal? I don't see the failed to get raft challenge message on successful joins, so I think this may be the right direction. This documentation mentions Once this is invoked, the joining node will receive a challenge from the Raft's leader node which sounds to me like the new node is reaching out to other nodes?

Furthermore, this is the only location I see the raft/bootstrap/challenge path mentioned: https://github.com/hashicorp/vault/blob/42a8cc118970a9be807c350eb7e179bcc1b24afd/vault/raft.go#L921 https://github.com/hashicorp/vault/blob/42a8cc118970a9be807c350eb7e179bcc1b24afd/vault/raft.go#L922

It looks to me like the intent here is for the new node to PUT to some existing node. But in my logs, the node is PUTing to itself?

It does indeed consider itself as a potential leader node:

Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: attempting to join possible raft leader node: leader_addr=http://10.144.87.99:8200
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: attempting to join possible raft leader node: leader_addr=http://10.144.87.88:8200
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: attempting to join possible raft leader node: leader_addr=http://10.144.87.60:8200
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.233Z [INFO]  core: attempting to join possible raft leader node: leader_addr=http://10.144.80.255:8200
Nov 17 03:51:44 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:51:44.234Z [INFO]  core: attempting to join possible raft leader node: leader_addr=http://10.144.82.96:8200

https://github.com/hashicorp/vault/blob/42a8cc118970a9be807c350eb7e179bcc1b24afd/vault/raft.go#L874

It looks like it is querying ALL discovered nodes for the raft challenge, in parallel, here:

https://github.com/hashicorp/vault/blob/42a8cc118970a9be807c350eb7e179bcc1b24afd/vault/raft.go#L1123

But then, only the first (or arbitrary one of?) of those parallel requests is actually used - the select{} block here reads the channel once and either propagates the error up (in my case, the result of the http 503 from above) or answers the challenge if it successfully retrieved one.

https://github.com/hashicorp/vault/blob/42a8cc118970a9be807c350eb7e179bcc1b24afd/vault/raft.go#L1140

Does this simply need a check to omit a node's own address from the list of possible leaders the raft challenge is retrieved from? Or, the select{} block above put into a loop so that it reads the results of all attempts to retrieve the challenge instead of just the first?

dpedu avatar Nov 17 '22 04:11 dpedu

I've taken a stab at the latter fix I describe above. It appears to have completely fixed this problem in my cluster.

#18005

dpedu avatar Nov 17 '22 06:11 dpedu

I suspect this is your real problem:

Nov 17 03:52:21 ci-vault-20221117035056245390 vault[1204]: 2022-11-17T03:52:21.825Z [WARN]  storage.raft: not part of stable configuration, aborting election

What does your autopilot configuration look like, specifically the cleanup dead servers threshold?

ncabatoff avatar Nov 17 '22 13:11 ncabatoff

@ncabatoff

I have a 5-node vault cluster running in Amazon AWS. I am using integrated storage and AWS tag-based discovery. I have autopilot configured with -cleanup-dead-servers=true -dead-server-last-contact-threshold=30 -min-quorum=3 -server-stabilization-time=30.

The other 4 nodes in the cluster are stable at the time this happens and vault is serving traffic normally.

dpedu avatar Nov 17 '22 15:11 dpedu

With your configuration, if a node ever takes more than 30s to complete the join and pull the logs/snapshot, then send an echo heartbeat to the leader, the leader will remove the new node from the raft configuration, resulting in the error above.

ncabatoff avatar Nov 17 '22 15:11 ncabatoff

Couldn't a node send heartbeats during the transfer to avoid this problem? As vault's database grows, this will take longer. Having a "tipping point" like this seems like a flaw.

dpedu avatar Nov 17 '22 15:11 dpedu

The problem is that until the node has pulled the raft logs or snapshot, it doesn't have the raft configuration, so it doesn't know where to send the heartbeat. Why do you have dead server cleanup set so aggressively? The default is 24h for a reason, i.e. the rather drastic consequences of removing a node that wasn't actually dead.

ncabatoff avatar Nov 17 '22 16:11 ncabatoff

I am using an aggressive dead server cleanup for the reason of avoiding breaking quorum. Here is a test I just performed using the 5-node n=3 cluster describe in the original post to demonstrate:

  1. I reconfigured autopilot with -cleanup-dead-servers=true -dead-server-last-contact-threshold=3600 -min-quorum=3 -server-stabilization-time=30. One hour before dead servers are cleaned up.
  2. I terminated two vault instances. Vault remains healthy because 3 servers are still up and healthy. The raft peer list still shows 5 members due to the increase timeout.
  3. I wait for 2 new instances to come up and join the cluster, stabilize, etc. Now there are 7 servers in the peer list, 5 healthy, 2 offline.
  4. I terminate 2 more instances and quorum is lost and the vault cluster becomes unhealthy. I see messages in the log about "Election timeout reached, restarting election". I can't see the list of nodes but we can reason that there would be 9 known nodes now, 4 offline, 3 that were healthy, and 2 more that will be attempting to join soon.

It appears that dead but not-yet-removed servers are factored into deciding whether we have quorum or not.

As I am running vault on a cloud platform and I do not have control over the underlying hardware, I cannot control how frequently vault nodes are removed and replaced. I can only control how many need to be present and healthy (healthy from the cloud platform's perspective) at one time.

I think you are right about the cause of this problem. Setting a higher but not ridiculously high dead-server-last-contact-threshold seemed to make the issue go away for now. I used 90 seconds. However, this is still a balance where both sides of the scale is disaster - too low and bootstrapping fails as you described. Too high and quorum is easily lost.

Couldn't a node send heartbeats during the transfer to avoid this problem?

The problem is that until the node has pulled the raft logs or snapshot, it doesn't have the raft configuration, so it doesn't know where to send the heartbeat.

I'm curious, since it is able to reach other nodes for the purpose of pulling raft logs or snapshots, how is it that it doesn't know where?

dpedu avatar Nov 17 '22 18:11 dpedu

As I am running vault on a cloud platform and I do not have control over the underlying hardware, I cannot control how frequently vault nodes are removed and replaced. I can only control how many need to be present and healthy (healthy from the cloud platform's perspective) at one time.

I don't understand this, can you elaborate? The only folks I've seen running into problems like this one is when they've decided they wanted to use autoscaling.

I'm curious, since it is able to reach other nodes for the purpose of pulling raft logs or snapshots, how is it that it doesn't know where?

Two different layers. All the Raft RPCs are handled in https://github.com/hashicorp/raft, and these aren't really directly usable by Vault. The vault code (https://github.com/hashicorp/vault/blob/main/vault/request_forwarding_rpc.go#L160-L160) needs to know how to issue an RPC to the active node, but with raft still coming online locally, we don't really have that capacity yet. The biggest obstacle is probably the cert that we need to do mTLS, which lives in storage. I'm not saying it's not possible, theoretically, but it seems like it would be a really big and disruptive effort.

ncabatoff avatar Nov 17 '22 19:11 ncabatoff

I see.

As I am running vault on a cloud platform and I do not have control over the underlying hardware, I cannot control how frequently vault nodes are removed and replaced. I can only control how many need to be present and healthy (healthy from the cloud platform's perspective) at one time.

I don't understand this, can you elaborate? The only folks I've seen running into problems like this one is when they've decided they wanted to use autoscaling.

I'm not using autoscaling in the sense that I want my cluster to automatically change size based on load. No, I don't want that.

I am however using an autoscaling group to spawn the instances, and it is set at a static count of 5. The reason for this is that autoscaling groups will automatically replace instances that suffer underlying hardware failure and user-specified health checks as well as handling addition/removal of the servers to a load balancer. Rather than having a human intervene.

The alternative would be to manually launch one-off instances, which pretty much defeats the purpose of using a cloud platform as there isn't a mechanism to automatically replace these.

Edit: that being said, this seems like an obstacle to those who do want to autoscale their cluster size as well?

dpedu avatar Nov 17 '22 20:11 dpedu

The only folks I've seen running into problems like this one is when they've decided they wanted to use autoscaling.

@ncabatoff I may have this issue, what are best practices / configurations to use alongside autoscaling ?

agill17 avatar Apr 12 '23 21:04 agill17

The documentation for the min_quorum option at https://developer.hashicorp.com/vault/api-docs/system/storage/raftautopilot#min_quorum sounds to me like it is implying that it should be set to the total number of servers, not the desired quorum size... which is weird as it conflicts with the naming of the option... but the code in https://github.com/hashicorp/raft-autopilot/blob/3d695316d0527c0aa658a8141761b42c23a72ccf/reconcile.go#L301-L302 seems to support (if I'm reading it right) that the option is incorrectly named.

In that case, min-quorum for this cluster needs to 5, not 3, which might help protect the cluster against unwanted removals.

maxb avatar Apr 13 '23 09:04 maxb

@dpedu What did you eventually did for fixing that problem? We have the same issue... We have a very large kv. I already increased the cleanup to 600 seconds, which allows 1 server to reboot and come back, but as soon as the second one goes offline, the cluster is dead (we have a 3 node cluster, n=3)

koen-venly avatar May 19 '23 08:05 koen-venly

@koen-venly I suggest it would be best if you started a topic in https://discuss.hashicorp.com to talk through your problem in a separate conversation - even though your problem sounds somewhat related, merely rebooting a server shouldn't involve leaving and joining the cluster at all, so I wonder if you actually have a different problem or misconfiguration.

maxb avatar May 19 '23 08:05 maxb

The error messages I get are Exactly the same... But ok, I'll check https://discuss.hashicorp.com/

koen-venly avatar May 19 '23 09:05 koen-venly

@koen-venly I believe the original cause for the problem I was having was having the server stabilization time too low, as ncabatoff pointed out. My database is somewhat large so it made sense that it was not enough time to transfer it to new nodes.

My final tuning for my 5-node cluster ended up like this, and has been stable since last year:

$ vault operator raft autopilot get-config
Key                                   Value
---                                   -----
Cleanup Dead Servers                  true
Last Contact Threshold                10s
Dead Server Last Contact Threshold    1m30s
Server Stabilization Time             1m30s
Min Quorum                            3
Max Trailing Logs                     1000

Also, I believe if you replace nodes too quickly - at a pace where they are not removed from quorum calculations - you'll see the same problem. The rate of node replacement needs to be slower than the Dead Server Last Contact Threshold plus the Last Contact Threshold combined.

dpedu avatar Jun 29 '23 19:06 dpedu

@dpedu thanks for your answer. In the mean time, because I have nothing but bad experiences with pure storage backends, we shifted away from raft. We now use a rds postgresql backend. Vault service is running on 3 ec2 machines. Surplus for this, is the fact that the ec2's can now have much lower specs and the data is safer to my opinion then on this raft storage backend. (Everything in previous is encrypted of course)

koen-venly avatar Jun 30 '23 16:06 koen-venly