Huge regression: Upgrading HA DynamoDB backend based Vault cluster to 1.11.0 and upwards breaks functional cluster
Describe the bug A fully functional HA Vault cluster whose backend is DynamoDB can be updated from 1.9.4 up to 1.10.7 without any problem Upgrading one more time to 1.11.0 just make the cluster non functional
To Reproduce Steps to reproduce the behavior:
- Run
curl http://127.0.0.1:8200/v1/sys/leader
{"ha_enabled":true,"is_self":false,"active_time":"0001-01-01T00:00:00Z","leader_address":"","leader_cluster_address":"","performance_standby":false,"performance_standby_last_remote_wal":0}
- Run
curl -sk --header "X-Vault-Token: XXXX" https://vault.internal.mycluster.io/v1/sys/ha-status | jq -rS .
{
"errors": [
"local node not active but active cluster node not found"
]
}
Expected behavior With version 1.10.7 and below we get:
{"ha_enabled":true,"is_self":false,"active_time":"0001-01-01T00:00:00Z","leader_address":"https://vault.internal.mycluster.io","leader_cluster_address":"https://vault.internal.mycluster.io:444","performance_standby":false,"performance_standby_last_remote_wal":0}
and
{
"auth": null,
"data": {
"nodes": [
{
"active_node": true,
"api_address": "https://vault.internal.mycluster.io",
"cluster_address": "https://vault.internal.mycluster.io:444",
"hostname": "vault-10-0-14-90",
"last_echo": null
},
{
"active_node": false,
"api_address": "https://vault.internal.mycluster.io",
"cluster_address": "https://vault.internal.mycluster.io:444",
"hostname": "vault-10-0-32-220",
"last_echo": "2022-10-29T09:30:43.564133091Z"
}
]
},
"lease_duration": 0,
"lease_id": "",
"nodes": [
{
"active_node": true,
"api_address": "https://vault.internal.mycluster.io",
"cluster_address": "https://vault.internal.mycluster.io:444",
"hostname": "vault-10-0-14-90",
"last_echo": null
},
{
"active_node": false,
"api_address": "https://vault.internal.mycluster.io",
"cluster_address": "https://vault.internal.mycluster.io:444",
"hostname": "vault-10-0-32-220",
"last_echo": "2022-10-29T09:30:43.564133091Z"
}
],
"renewable": false,
"request_id": "5de2e940-d6de-9b06-a6da-2293c5aa0e79",
"warnings": null,
"wrap_info": null
}
Environment:
- Vault Server Version (retrieve with
vault status): 1.11.0 - Vault CLI Version (retrieve with
vault version): 1.11.0 - Server Operating System/Architecture: Amazon Linux or any Linux based system
Vault server configuration file(s):
listener "tcp" {
address = "0.0.0.0:8200"
tls_disable = true
}
# Mandatory for enabling HA enabled DynamoDB storage backend
api_addr = "https://vault.internal.mycluster.io"
storage "dynamodb" {
ha_enabled = "true"
table = "mycluster-vault-data"
region = "eu-west-1"
}
seal "awskms" {
kms_key_id = "XXXXXXXXXXXX"
}
ui = true
Additional context The functional cluster was originally running Vault 1.9.4 and I tried to upgrade it directly to 1.12.0 which failed with the behaviour described above I therefore decided to rollback to 1.9.4 and to upgrade one version at a time until it breaks. The following versions were tested successfully:
- 1.9.5, 1.9.6, 1.9.7, 1.9.8, 1.9.9, 1.9.10
- 1.10.0, 1.10.1, 1.10.2, 1.10.3, 1.10.4, 1.10.5, 1.10.6, 1.10.7
1.11.0 and upwards just breaks
It is to be noted also that the vault cluster as it is residing within an AWS infrastructure is behind a Network LoadBalancer properly configured as can be seen in the attached files
Putting the Vault processes in debug/trace mode did not help finding any valuable information which could solve the issue
I could not see anything either in the changelog of 1.11.0 which could help

Doing a git diff tag-v1.10.7..tag-v1.11.0 -- vault/ha.go show the following code which seems to be the source of the issue
with the corresponding commit 106f548a41c5aa3a7310e1b53f2196f989913eff
Commenting out the return line just makes the cluster functional again
@ncabatoff any insights on this please ?
May be the test if adv.ClusterAddr == c.ClusterAddr() && adv.RedirectAddr == c.redirectAddr { is just incomplete as in the comment it is said that if we are the active node but this does not seem to be really checked because behind a load balancer like in AWS the addresses checked are "statically" defined in Route53, only the load balancer distribute the request evenly across all nodes :-(
This explains why the test is always true in AWS behind a LB and therefore why the Vault HA cluster is not functional any more
$ dig +short vault.internal.mycluster.io
10.0.15.206
10.0.36.214
10.0.19.203
and therefore
RedirectAddr:https://vault.internal.mycluster.io ClusterAddr:https://vault.internal.mycluster.io:444
in the received JSON advertisement structure and the same "addresses" are in the Core structure
As a first suggestion of what could be a more complete test which might solve the issue is to do a
url.ParseRequestURI on the string. If this fails, then comparison can go on as is else
net.LookupIP(u.host) can be checked for the number of records it contains ...
My very basic 2 cents
@ccapurso many thanks for categorising this
AFAICT I do not think that having DynamoDB as the vault backend has anything to do with the issue. I am pretty much sure I can reproduce this with Raft backend. Willl try to demonstrate this if time permits
Hi @obourdon we saw the same issue today on our cluster running 1.11.3 with Dyanamodb backend and AWS behind an NLB.
Do you know how to reproduce this bug ? We have been running 1.11.X for the past month and only noticed it today.
@ArchiFleKs on my side with a working HA cluster (Vault <=1.10.7) I just replace the vault binary on any one node by any version >= 1.11.0 and relaunch the associated Linux service then run the command
curl http://127.0.0.1:8200/v1/sys/leader on the modified node and I get the answer without any info (aka empty string in leader_address and leader_cluster_address of the JSON response
Note that other nodes are still functional (at least they are in my setup)
I pretty much also guess that this might change if the node you replace the Vault binary on is the current leader/active node. Any Vault call which will be targeted at the now modified node will be failing (in my case 1 out of 3 times because cluster has 3 nodes and LB dispatch requests evenly)
The situation can completely be reversed even with an entirely failing cluster by replacing Vault by any version <= 1.10.7 on all failing nodes (at least it was on my setup were I now have everything back to normal
@ArchiFleKs on my side with a working HA cluster (Vault <=1.10.7) I just replace the vault binary on any one node by any version >= 1.11.0 and relaunch the associated Linux service then run the command
curl http://127.0.0.1:8200/v1/sys/leaderon the modified node and I get the answer without any info (aka empty string inleader_addressandleader_cluster_addressof the JSON response Note that other nodes are still functional (at least they are in my setup)I pretty much also guess that this might change if the node you replace the Vault binary on is the current leader/active node. Any Vault call which will be targeted at the now modified node will be failing (in my case 1 out of 3 times because cluster has 3 nodes and LB dispatch requests evenly)
The situation can completely be reversed even with an entirely failing cluster by replacing Vault by any version <= 1.10.7 on all failing nodes (at least it was on my setup were I now have everything back to normal
Thanks for your inputs. Currently in the process of downgrading, what about 1.10.8 released 2 days ago ?
@ArchiFleKs seems like it should be working perfectly like 1.10.7 (at least it does not include the failing 1.11.0 code)
I do not see any activity on this huge regression issue. Can someone please take care of this ?
Still no feedback on this ????
Hi, @obourdon. Thank you for your patience.
The DynamoDB storage backend is community-supported and thus in-depth knowledge is a bit limited on the Vault team. You mentioned potentially attempting to reproduce this issue using Integrated Storage (raft). Were you able to accomplish that?
@obourdon I think the reason this change is impacting you is because every node has the same cluster_address, https://vault.internal.mycluster.io:444. Do you know what that might be? I recognize that cluster_addr isn't needed for DynamoDB-based clusters, but I believe it should default to a unique value for each node if you don't configure it, specifically, listenerIP:apiPort+1.
May be the test if adv.ClusterAddr == c.ClusterAddr() && adv.RedirectAddr == c.redirectAddr { is just incomplete as in the comment it is said that if we are the active node but this does not seem to be really checked
We know we're not the active node because at the top of the function we return if !c.standby. An OSS Vault cluster node is either active or standby, so if it's not standby it must be active.
I encountered the same issue but we are using specifc cluster_addr per node and cannot reproduce right now
I encountered the same issue but we are using specifc
cluster_addrper node and cannot reproduce right now
If you're referring to the error "local node not active but active cluster node not found", I would not assume that's the same issue as this one. That's a whole class of HA failures, if you search for it you'll see it comes up in various circumstances. It can be a perfectly legitimate intermittent error even in a mostly healthy cluster, I would only worry about it if it persists or recurs frequently.
I must admit that I do not get some of the points described above
AFAIU the only way to describe the cluster address when behind a load balancer using AWS ASG is to specify the LB address as you do not know in advance what would be your instances IPs and if one instance is failing and a new one respawned by the ASG you should not have to change your Vault cluster configuration files
I indeed do not think that the issue is related to the fact that the backend is based on DynamoDB by looking at the location of the failing code and I still do not understand why this change was done in the first place and what it is susposed to fix
@ArchiFleKs
I encountered the same issue but we are using specifc
cluster_addrper node and cannot reproduce right now do you mean that you have the same issue even though you have a differentcluster_addrin each vault cluster node config file ? Your sentence is quite confusing because of thecannot reproduce right nowat the end of it
@obourdon Yes I encountered the same issue at the same time you reported this bug, but since then I kept Vault > 1.10 running in our dev environment and I was not able to reproduce the issue nor did I encounter it a second time.
And yes we are using a different cluster_addr which is auto computed when a new instance boot up on the ASG, so our configuration looks like this
If you're referring to the error
"local node not active but active cluster node not found", I would not assume that's the same issue as this one. That's a whole class of HA failures, if you search for it you'll see it comes up in various circumstances. It can be a perfectly legitimate intermittent error even in a mostly healthy cluster, I would only worry about it if it persists or recurs frequently.
@ncabatoff you're right that's what lead me to believe it might not have been the same issue despite using the same backend (dynamoDB). And it might have been just a transitive issue. Because I was not able to reproduce it in the end.
@ArchiFleKs many thanks for all the details above an indeed we also currently keep the Vault version strictly under 1.11 to make sure that the source code does not contain the issue
It would be very nice if Vault could support the same GetPrivateIP Golang templating configuration option like Consul does but it seems that this idea was rejected :-(
@ncabatoff if you could spend some time explaining why this code was changed in the first place in more details as it is leading to a definitely NOT transient but permanent and definitive Vault cluster failure, we would greatly appreciate
still no activity on this one
@ArchiFleKs many thanks for all the details above an indeed we also currently keep the Vault version strictly under 1.11 to make sure that the source code does not contain the issue
It would be very nice if Vault could support the same
GetPrivateIPGolang templating configuration option like Consul does but it seems that this idea was rejected :-(
We wound up doing it in #9109.
@ncabatoff if you could spend some time explaining why this code was changed in the first place in more details as it is leading to a definitely NOT transient but permanent and definitive Vault cluster failure, we would greatly appreciate
It was to prevent attempting to connect to ourself, which was causing problems in some tests.
@ncabatoff could you be a bit more specific and point us to the failing tests please and how to potentially run them locally on our development machines I am really wondering if this fix is necessary as it induces a huge regression and may be the tests mentioned are the ones which should be corrected (it sometimes happens on my side that the tests I wrote are bad/badly designed in the first place)
Any update on this ?
I missed the release window on this one, please note that the fix won't be in the next releases, but rather the ones next month.