vault icon indicating copy to clipboard operation
vault copied to clipboard

Standby Status 474 after Update to 1.19.0

Open usernamemikem opened this issue 9 months ago • 12 comments

Hi,

OS Ubuntu 24.04. AWS EC2 Cluster, 3 Vault servers, two on standby, 5 Consul storage servers.

Updated V 1.18.5 to 1.19.0. The standby server comes up with http status 474 instead of 429.

Purged V 1.19.0, reinstalled 1.18.5 and status back to 429.

I was able to reproduce this on our DR cluster.

Thanks.

Mike

usernamemikem avatar Mar 11 '25 02:03 usernamemikem

Is anyone going to bother to look at this error?

usernamemikem avatar Mar 25 '25 14:03 usernamemikem

Can you please provide more information about your setup and the specific errors you're getting? For example, from the bug report template:

Environment:

  • Vault Server Version (retrieve with vault status):
  • Vault CLI Version (retrieve with vault version):
  • Server Operating System/Architecture:

Vault server configuration file(s):

# Paste your Vault config here.
# Be sure to scrub any sensitive values

Thanks!

heatherezell avatar Mar 26 '25 22:03 heatherezell

Hi,

As stated above, environment: OS Ubuntu 24.04. AWS EC2 Cluster, 3 Vault servers, two on standby, 5 Consul storage servers.

From Standby Server, current status 429

Before Upgrade:

vault version Vault v1.18.0 (77f26ba561a4b6b1ccd5071b8624cefef7a72e84), built 2024-10-08T09:12:52Z

vault status Key Value


Seal Type shamir Initialized true Sealed false Total Shares 5 Threshold 3 Version 1.18.0 Build Date 2024-10-08T09:12:52Z Storage Type consul Cluster Name vault-cluster-6bd3c00d Cluster ID 8440b133-5d4f-8e77-78ac-502a5e87df30 HA Enabled true HA Cluster https://vault-dr.cloud.triciti.com:8201 HA Mode active Active Since 2025-03-27T17:42:13.759605605Z

After Upgrade:

Standby now status 474.

vault version Vault v1.19.0 (7eeafb6160d60ede73c1d95566b0c8ea54f3cb5a), built 2025-03-04T12:36:40Z

Status is the same, the cluster is up and running.

Error from Standby:

2025-03-27T18:00:15.784303+00:00 ip-10-61-2-11 consul[552]: 2025-03-27T18:00:15.783Z [INFO] agent: Synced check: check=vault:vault-dr.cloud.triciti.com:8200:vault-sealed-check 2025-03-27T18:01:17.769923+00:00 ip-10-61-2-11 vault[568]: 2025-03-27T18:01:17.769Z [INFO] core.cluster-listener.tcp: starting listener: listener_address=0.0.0.0:8201 2025-03-27T18:01:17.770307+00:00 ip-10-61-2-11 vault[568]: 2025-03-27T18:01:17.769Z [INFO] core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201 2025-03-27T18:01:17.770355+00:00 ip-10-61-2-11 vault[568]: 2025-03-27T18:01:17.769Z [INFO] core: vault is unsealed 2025-03-27T18:01:17.770396+00:00 ip-10-61-2-11 vault[568]: 2025-03-27T18:01:17.769Z [WARN] service_registration.consul: concurrent initialize state change notify dropped 2025-03-27T18:01:17.770438+00:00 ip-10-61-2-11 vault[568]: 2025-03-27T18:01:17.769Z [INFO] core: entering standby mode 2025-03-27T18:01:17.784105+00:00 ip-10-61-2-11 consul[552]: 2025-03-27T18:01:17.783Z [INFO] agent: Synced check: check=vault:vault-dr.cloud.triciti.com:8200:vault-sealed-check 2025-03-27T18:01:38.026468+00:00 ip-10-61-2-11 vault[568]: 2025-03-27T18:01:38.025Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.61.2.43:8201: i/o timeout"" 2025-03-27T18:01:38.026628+00:00 ip-10-61-2-11 vault[568]: 2025-03-27T18:01:38.026Z [ERROR] core: forward request error: error="error during forwarding RPC request" 2025-03-27T18:02:12.863686+00:00 ip-10-61-2-11 consul[552]: 2025-03-27T18:02:12.862Z [INFO] agent: Synced service: service=vault:vault-dr.cloud.triciti.com:8200 2025-03-27T18:02:12.873369+00:00 ip-10-61-2-11 consul[552]: 2025-03-27T18:02:12.873Z [INFO] agent: Synced check: check=vault:vault-dr.cloud.triciti.com:8200:vault-sealed-check 2025-03-27T18:02:40.140231+00:00 ip-10-61-2-11 vault[568]: 2025-03-27T18:02:40.139Z [INFO] http: TLS handshake error from 127.0.0.1:42484: remote error: tls: bad certificate

If I remove version 1.19.0 and go back to 1.18, Standby status goes back to 429.

Please let me know if you need any additional information.

Thanks for your help.

Mike

usernamemikem avatar Mar 27 '25 18:03 usernamemikem

I fixed the bad certificate error, but same results.

vault[1349]: 2025-03-27T18:14:32.984Z [INFO] core: entering standby mode 2025-03-27T18:14:32.998921+00:00 ip-10-61-2-11 consul[552]: 2025-03-27T18:14:32.998Z [INFO] agent: Synced check: check=vault:vault-dr.cloud.triciti.com:8200:vault-sealed-check 2025-03-27T18:14:53.249455+00:00 ip-10-61-2-11 vault[1349]: 2025-03-27T18:14:53.248Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.61.2.43:8201: i/o timeout"" 2025-03-27T18:14:53.252042+00:00 ip-10-61-2-11 vault[1349]: 2025-03-27T18:14:53.249Z [ERROR] core: forward request error: error="error during forwarding RPC request"

Health checks failed with these codes: [474]

Thanks.

Mike

usernamemikem avatar Mar 27 '25 18:03 usernamemikem

Thanks! 474 indicates the standby node can't talk to the active node. When the cluster is running, is the Vault port for your active node (10.61.2.43:8201) accessible?

heatherezell avatar Mar 27 '25 20:03 heatherezell

Hi.

No, but all previous version worked. The main URL is an internal load balancer with only port 8200 open. It has never in all the years we've used it, has it forwarded port 8201. This IP 10.61.2.43:8201, is the load balancer. The actual servers themselves have port 8201 open for each other, but not for the load balancer.

What changed? What is different with V1.19.0? My other standby using the exact same security policies, subnets, open ports ..... is working just fine.

Looks like V1.19.0 is now using the load balancer address instead of the actual server addresses like the previous versions.

Thanks.

Mike

usernamemikem avatar Mar 27 '25 21:03 usernamemikem

Thanks for that information! I'm wondering if this PR changed the functionality: https://github.com/hashicorp/vault/pull/28991 I'll check with our engineering teams.

heatherezell avatar Mar 27 '25 21:03 heatherezell

Image

server_1 is running V1.19.0 server_2 and 0 are running V1.18.0

You can see that standby v1.18.0 is fine running on the same system without error.

Thanks.

Mike

usernamemikem avatar Mar 27 '25 21:03 usernamemikem

This is interesting, here is the successful log of V1.18.0, yet is shows the same error port 8201.

The consul shows three sync'd checks, while the above only shows one.

2025-03-27T21:31:14.976817+00:00 ip-10-61-3-12 vault[571]: 2025-03-27T21:31:14.976Z [INFO] core.cluster-listener.tcp: starting listener: listener_address=0.0.0.0:8201 2025-03-27T21:31:14.976924+00:00 ip-10-61-3-12 vault[571]: 2025-03-27T21:31:14.976Z [INFO] core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201 2025-03-27T21:31:14.976960+00:00 ip-10-61-3-12 vault[571]: 2025-03-27T21:31:14.976Z [INFO] core: vault is unsealed 2025-03-27T21:31:14.976994+00:00 ip-10-61-3-12 vault[571]: 2025-03-27T21:31:14.976Z [WARN] service_registration.consul: concurrent initialize state change notify dropped 2025-03-27T21:31:14.977027+00:00 ip-10-61-3-12 vault[571]: 2025-03-27T21:31:14.976Z [INFO] core: entering standby mode 2025-03-27T21:31:14.991522+00:00 ip-10-61-3-12 consul[554]: 2025-03-27T21:31:14.990Z [INFO] agent: Synced check: check=vault:vault-dr.cloud.triciti.com:8200:vault-sealed-check 2025-03-27T21:31:29.661197+00:00 ip-10-61-3-12 consul[554]: 2025-03-27T21:31:29.660Z [INFO] agent: Synced service: service=vault:vault-dr.cloud.triciti.com:8200 2025-03-27T21:31:29.673019+00:00 ip-10-61-3-12 consul[554]: 2025-03-27T21:31:29.672Z [INFO] agent: Synced check: check=vault:vault-dr.cloud.triciti.com:8200:vault-sealed-check 2025-03-27T21:31:35.232203+00:00 ip-10-61-3-12 vault[571]: 2025-03-27T21:31:35.231Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.61.3.62:8201: i/o timeout"" 2025-03-27T21:31:35.232553+00:00 ip-10-61-3-12 vault[571]: 2025-03-27T21:31:35.231Z [ERROR] core: forward request error: error="error during forwarding RPC request"

Thanks again.

Mike

usernamemikem avatar Mar 27 '25 21:03 usernamemikem

Oh, that is interesting! I know the status codes changed in 1.19, so I'm going to follow up and make sure that it's doing what is intended, and that it's thoroughly documented. Really appreciate your patience!

heatherezell avatar Mar 27 '25 22:03 heatherezell

Hi,

When I enter the keys to unseal the standby node, it shows Error: This is a standby Vault node but can't communicate with the active node via request forwarding. Sign in at the active node to use the Vault UI.

Thanks.

Mike

usernamemikem avatar Mar 28 '25 04:03 usernamemikem

Vault 1.19.3 still returns 474. Is there a plan to fallback on the original behavior? or any workaround?

satyamz avatar May 16 '25 14:05 satyamz

Hi, Issue is still exists with the new release v1.20.0. I believe I now know why. The status code will start as 429 then changes to 474 after the following error.

vault[575]: 2025-07-09T19:07:46.017Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: tls: failed to verify certificate: x509: certificate is valid for *.cloud.triciti.com, cloud.triciti.com, not fw-8d1bc8b5-81e2-3194-e92c-f54472f05794""

You must have updated your error output because all it said before was bad certificate.

Now it says the cert is valid for the correct domain but not *fw-8d1bc8b5-81e2-3194-e92c-f54472f05794*

What is fw-8d1bc8b5-81e2-3194-e92c-f54472f05794\ ??

It's not the cert identifier or ARN, it's not the name of the internal load balancer.

It seems it's looking at the wrong item/variable to evaluate. The new error is quite explicit.

Your developers must know what they changed. Unless you hired developers from Microsoft.

Thanks.

Mike

usernamemikem avatar Jul 09 '25 19:07 usernamemikem

Thank you for the additional information! That's super helpful. I'll see what I can find out - thanks again! :)

heatherezell avatar Jul 09 '25 20:07 heatherezell

@heatherezell - We're seeing this issue as well. Health checks fail on the 474 status code and our quorum becomes unavailable.

joshcruz67 avatar Aug 13 '25 14:08 joshcruz67