vault icon indicating copy to clipboard operation
vault copied to clipboard

Huge regression: Upgrading HA DynamoDB backend based Vault cluster to 1.11.0 and upwards breaks functional cluster

Open obourdon opened this issue 3 years ago • 21 comments

Describe the bug A fully functional HA Vault cluster whose backend is DynamoDB can be updated from 1.9.4 up to 1.10.7 without any problem Upgrading one more time to 1.11.0 just make the cluster non functional

To Reproduce Steps to reproduce the behavior:

  1. Run curl http://127.0.0.1:8200/v1/sys/leader
{"ha_enabled":true,"is_self":false,"active_time":"0001-01-01T00:00:00Z","leader_address":"","leader_cluster_address":"","performance_standby":false,"performance_standby_last_remote_wal":0}
  1. Run curl -sk --header "X-Vault-Token: XXXX" https://vault.internal.mycluster.io/v1/sys/ha-status | jq -rS .
{
  "errors": [
    "local node not active but active cluster node not found"
  ]
}

Expected behavior With version 1.10.7 and below we get:

{"ha_enabled":true,"is_self":false,"active_time":"0001-01-01T00:00:00Z","leader_address":"https://vault.internal.mycluster.io","leader_cluster_address":"https://vault.internal.mycluster.io:444","performance_standby":false,"performance_standby_last_remote_wal":0}

and

{
  "auth": null,
  "data": {
    "nodes": [
      {
        "active_node": true,
        "api_address": "https://vault.internal.mycluster.io",
        "cluster_address": "https://vault.internal.mycluster.io:444",
        "hostname": "vault-10-0-14-90",
        "last_echo": null
      },
      {
        "active_node": false,
        "api_address": "https://vault.internal.mycluster.io",
        "cluster_address": "https://vault.internal.mycluster.io:444",
        "hostname": "vault-10-0-32-220",
        "last_echo": "2022-10-29T09:30:43.564133091Z"
      }
    ]
  },
  "lease_duration": 0,
  "lease_id": "",
  "nodes": [
    {
      "active_node": true,
      "api_address": "https://vault.internal.mycluster.io",
      "cluster_address": "https://vault.internal.mycluster.io:444",
      "hostname": "vault-10-0-14-90",
      "last_echo": null
    },
    {
      "active_node": false,
      "api_address": "https://vault.internal.mycluster.io",
      "cluster_address": "https://vault.internal.mycluster.io:444",
      "hostname": "vault-10-0-32-220",
      "last_echo": "2022-10-29T09:30:43.564133091Z"
    }
  ],
  "renewable": false,
  "request_id": "5de2e940-d6de-9b06-a6da-2293c5aa0e79",
  "warnings": null,
  "wrap_info": null
}

Environment:

  • Vault Server Version (retrieve with vault status): 1.11.0
  • Vault CLI Version (retrieve with vault version): 1.11.0
  • Server Operating System/Architecture: Amazon Linux or any Linux based system

Vault server configuration file(s):

listener "tcp" {
  address = "0.0.0.0:8200"
  tls_disable = true
}

# Mandatory for enabling HA enabled DynamoDB storage backend
api_addr = "https://vault.internal.mycluster.io"

storage "dynamodb" {
  ha_enabled = "true"
  table      = "mycluster-vault-data"
  region     = "eu-west-1"
}

seal "awskms" {
  kms_key_id = "XXXXXXXXXXXX"
}

ui = true

Additional context The functional cluster was originally running Vault 1.9.4 and I tried to upgrade it directly to 1.12.0 which failed with the behaviour described above I therefore decided to rollback to 1.9.4 and to upgrade one version at a time until it breaks. The following versions were tested successfully:

  • 1.9.5, 1.9.6, 1.9.7, 1.9.8, 1.9.9, 1.9.10
  • 1.10.0, 1.10.1, 1.10.2, 1.10.3, 1.10.4, 1.10.5, 1.10.6, 1.10.7

1.11.0 and upwards just breaks

It is to be noted also that the vault cluster as it is residing within an AWS infrastructure is behind a Network LoadBalancer properly configured as can be seen in the attached files

Putting the Vault processes in debug/trace mode did not help finding any valuable information which could solve the issue

I could not see anything either in the changelog of 1.11.0 which could help

obourdon avatar Oct 29 '22 09:10 obourdon

Screenshot 2022-10-29 at 11 37 10 Screenshot 2022-10-29 at 11 37 23

obourdon avatar Oct 29 '22 09:10 obourdon

Doing a git diff tag-v1.10.7..tag-v1.11.0 -- vault/ha.go show the following code which seems to be the source of the issue

with the corresponding commit 106f548a41c5aa3a7310e1b53f2196f989913eff

Commenting out the return line just makes the cluster functional again

@ncabatoff any insights on this please ?

obourdon avatar Oct 31 '22 13:10 obourdon

May be the test if adv.ClusterAddr == c.ClusterAddr() && adv.RedirectAddr == c.redirectAddr { is just incomplete as in the comment it is said that if we are the active node but this does not seem to be really checked because behind a load balancer like in AWS the addresses checked are "statically" defined in Route53, only the load balancer distribute the request evenly across all nodes :-(

This explains why the test is always true in AWS behind a LB and therefore why the Vault HA cluster is not functional any more

obourdon avatar Oct 31 '22 14:10 obourdon

$ dig +short vault.internal.mycluster.io
10.0.15.206
10.0.36.214
10.0.19.203

and therefore

RedirectAddr:https://vault.internal.mycluster.io ClusterAddr:https://vault.internal.mycluster.io:444

in the received JSON advertisement structure and the same "addresses" are in the Core structure

obourdon avatar Oct 31 '22 14:10 obourdon

As a first suggestion of what could be a more complete test which might solve the issue is to do a url.ParseRequestURI on the string. If this fails, then comparison can go on as is else net.LookupIP(u.host) can be checked for the number of records it contains ...

My very basic 2 cents

obourdon avatar Oct 31 '22 14:10 obourdon

@ccapurso many thanks for categorising this

obourdon avatar Oct 31 '22 17:10 obourdon

AFAICT I do not think that having DynamoDB as the vault backend has anything to do with the issue. I am pretty much sure I can reproduce this with Raft backend. Willl try to demonstrate this if time permits

obourdon avatar Nov 01 '22 17:11 obourdon

Hi @obourdon we saw the same issue today on our cluster running 1.11.3 with Dyanamodb backend and AWS behind an NLB.

Do you know how to reproduce this bug ? We have been running 1.11.X for the past month and only noticed it today.

ArchiFleKs avatar Nov 03 '22 13:11 ArchiFleKs

@ArchiFleKs on my side with a working HA cluster (Vault <=1.10.7) I just replace the vault binary on any one node by any version >= 1.11.0 and relaunch the associated Linux service then run the command curl http://127.0.0.1:8200/v1/sys/leader on the modified node and I get the answer without any info (aka empty string in leader_address and leader_cluster_address of the JSON response Note that other nodes are still functional (at least they are in my setup)

I pretty much also guess that this might change if the node you replace the Vault binary on is the current leader/active node. Any Vault call which will be targeted at the now modified node will be failing (in my case 1 out of 3 times because cluster has 3 nodes and LB dispatch requests evenly)

The situation can completely be reversed even with an entirely failing cluster by replacing Vault by any version <= 1.10.7 on all failing nodes (at least it was on my setup were I now have everything back to normal

obourdon avatar Nov 03 '22 16:11 obourdon

@ArchiFleKs on my side with a working HA cluster (Vault <=1.10.7) I just replace the vault binary on any one node by any version >= 1.11.0 and relaunch the associated Linux service then run the command curl http://127.0.0.1:8200/v1/sys/leader on the modified node and I get the answer without any info (aka empty string in leader_address and leader_cluster_address of the JSON response Note that other nodes are still functional (at least they are in my setup)

I pretty much also guess that this might change if the node you replace the Vault binary on is the current leader/active node. Any Vault call which will be targeted at the now modified node will be failing (in my case 1 out of 3 times because cluster has 3 nodes and LB dispatch requests evenly)

The situation can completely be reversed even with an entirely failing cluster by replacing Vault by any version <= 1.10.7 on all failing nodes (at least it was on my setup were I now have everything back to normal

Thanks for your inputs. Currently in the process of downgrading, what about 1.10.8 released 2 days ago ?

ArchiFleKs avatar Nov 03 '22 22:11 ArchiFleKs

@ArchiFleKs seems like it should be working perfectly like 1.10.7 (at least it does not include the failing 1.11.0 code)

obourdon avatar Nov 04 '22 10:11 obourdon

I do not see any activity on this huge regression issue. Can someone please take care of this ?

obourdon avatar Nov 16 '22 15:11 obourdon

Still no feedback on this ????

obourdon avatar Nov 28 '22 09:11 obourdon

Hi, @obourdon. Thank you for your patience.

The DynamoDB storage backend is community-supported and thus in-depth knowledge is a bit limited on the Vault team. You mentioned potentially attempting to reproduce this issue using Integrated Storage (raft). Were you able to accomplish that?

ccapurso avatar Dec 06 '22 20:12 ccapurso

@obourdon I think the reason this change is impacting you is because every node has the same cluster_address, https://vault.internal.mycluster.io:444. Do you know what that might be? I recognize that cluster_addr isn't needed for DynamoDB-based clusters, but I believe it should default to a unique value for each node if you don't configure it, specifically, listenerIP:apiPort+1.

May be the test if adv.ClusterAddr == c.ClusterAddr() && adv.RedirectAddr == c.redirectAddr { is just incomplete as in the comment it is said that if we are the active node but this does not seem to be really checked

We know we're not the active node because at the top of the function we return if !c.standby. An OSS Vault cluster node is either active or standby, so if it's not standby it must be active.

ncabatoff avatar Dec 12 '22 14:12 ncabatoff

I encountered the same issue but we are using specifc cluster_addr per node and cannot reproduce right now

ArchiFleKs avatar Dec 16 '22 09:12 ArchiFleKs

I encountered the same issue but we are using specifc cluster_addr per node and cannot reproduce right now

If you're referring to the error "local node not active but active cluster node not found", I would not assume that's the same issue as this one. That's a whole class of HA failures, if you search for it you'll see it comes up in various circumstances. It can be a perfectly legitimate intermittent error even in a mostly healthy cluster, I would only worry about it if it persists or recurs frequently.

ncabatoff avatar Dec 16 '22 15:12 ncabatoff

I must admit that I do not get some of the points described above

AFAIU the only way to describe the cluster address when behind a load balancer using AWS ASG is to specify the LB address as you do not know in advance what would be your instances IPs and if one instance is failing and a new one respawned by the ASG you should not have to change your Vault cluster configuration files

I indeed do not think that the issue is related to the fact that the backend is based on DynamoDB by looking at the location of the failing code and I still do not understand why this change was done in the first place and what it is susposed to fix

@ArchiFleKs

I encountered the same issue but we are using specifc cluster_addr per node and cannot reproduce right now do you mean that you have the same issue even though you have a different cluster_addr in each vault cluster node config file ? Your sentence is quite confusing because of the cannot reproduce right now at the end of it

obourdon avatar Dec 19 '22 17:12 obourdon

@obourdon Yes I encountered the same issue at the same time you reported this bug, but since then I kept Vault > 1.10 running in our dev environment and I was not able to reproduce the issue nor did I encounter it a second time.

And yes we are using a different cluster_addr which is auto computed when a new instance boot up on the ASG, so our configuration looks like this

ArchiFleKs avatar Dec 19 '22 17:12 ArchiFleKs

If you're referring to the error "local node not active but active cluster node not found", I would not assume that's the same issue as this one. That's a whole class of HA failures, if you search for it you'll see it comes up in various circumstances. It can be a perfectly legitimate intermittent error even in a mostly healthy cluster, I would only worry about it if it persists or recurs frequently.

@ncabatoff you're right that's what lead me to believe it might not have been the same issue despite using the same backend (dynamoDB). And it might have been just a transitive issue. Because I was not able to reproduce it in the end.

ArchiFleKs avatar Dec 19 '22 17:12 ArchiFleKs

@ArchiFleKs many thanks for all the details above an indeed we also currently keep the Vault version strictly under 1.11 to make sure that the source code does not contain the issue

It would be very nice if Vault could support the same GetPrivateIP Golang templating configuration option like Consul does but it seems that this idea was rejected :-(

@ncabatoff if you could spend some time explaining why this code was changed in the first place in more details as it is leading to a definitely NOT transient but permanent and definitive Vault cluster failure, we would greatly appreciate

obourdon avatar Dec 20 '22 09:12 obourdon

still no activity on this one

obourdon avatar Jan 02 '23 20:01 obourdon

@ArchiFleKs many thanks for all the details above an indeed we also currently keep the Vault version strictly under 1.11 to make sure that the source code does not contain the issue

It would be very nice if Vault could support the same GetPrivateIP Golang templating configuration option like Consul does but it seems that this idea was rejected :-(

We wound up doing it in #9109.

@ncabatoff if you could spend some time explaining why this code was changed in the first place in more details as it is leading to a definitely NOT transient but permanent and definitive Vault cluster failure, we would greatly appreciate

It was to prevent attempting to connect to ourself, which was causing problems in some tests.

ncabatoff avatar Jan 03 '23 13:01 ncabatoff

@ncabatoff could you be a bit more specific and point us to the failing tests please and how to potentially run them locally on our development machines I am really wondering if this fix is necessary as it induces a huge regression and may be the tests mentioned are the ones which should be corrected (it sometimes happens on my side that the tests I wrote are bad/badly designed in the first place)

obourdon avatar Jan 19 '23 10:01 obourdon

Any update on this ?

obourdon avatar Mar 06 '23 09:03 obourdon

I missed the release window on this one, please note that the fix won't be in the next releases, but rather the ones next month.

ncabatoff avatar Mar 24 '23 15:03 ncabatoff