vault
vault copied to clipboard
Vault is giving error while request forwarding.
Environment: DEV/PROD
- Vault Version:1.11.0
- Operating System/Architecture:AWS EC2 Machine (Linux)
Vault Config File:
"storage": {
"dynamodb": {
"ha_enabled": "true",
"region": "${AWS_REGION}",
"table": "${DYNAMO_DB_TABLE_NAME}"
}
},
"listener": {
"tcp": {
"address": "0.0.0.0:8200",
"tls_disable": "true"
}
},
"seal": {
"awskms": {
"region":"${AWS_REGION}",
"kms_key_id":"${KMS_KEY_ID}"
}
},
"ui": true
}
Startup Log Output:
2022-11-16T13:55:38.294Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.0.2.253:444: i/o timeout\""
2022-11-16T13:55:38.294Z [ERROR] core: forward request error: error="error during forwarding RPC request"
2022-11-16T13:55:38.357Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.0.2.253:444: i/o timeout\""
2022-11-16T13:55:38.357Z [ERROR] core: forward request error: error="error during forwarding RPC request"
2022-11-16T13:55:38.375Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.0.2.253:444: i/o timeout\""
2022-11-16T13:55:38.375Z [ERROR] core: forward request error: error="error during forwarding RPC request"
2022-11-16T13:55:38.394Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.0.2.253:444: i/o timeout\""
2022-11-16T13:55:38.394Z [ERROR] core: forward request error: error="error during forwarding RPC request"
Expected Behavior:
- We have setup of 2 vault nodes as vault-a & vault-b , each others are resolved by his DNS (vault-a.com & vault-b.com) expected that it will do request forwarding with each other without any error.
Actual Behavior:
- All the time when we hit request to get data from vault above error appears in vault and the service which is calling that internally vault gives 500(internal server error).(posted ss below.)
Steps to Reproduce:
-
We have 2 nodes with same configuration and same database as vault-a and vault-b
-
One of the node is active and another one is in standby.
-
We placed a LoadBalancer which will forward req. on each node in round robin.
-
whenever any request come we are getting above RPC error in our console of stand by node.
-
We are providing vault_api_addr through our env variables which are DNS EX.vault-a.com & vault-b.com
Important Factoids:
- Here is our docker-file also
FROM vault:1.11.0 WORKDIR /vault
ENV DYNAMO_DB_TABLE_NAME=vault ENV RUNTIME_DEPS=gettext ENV BUILD_DEPS="build_deps moreutils"
RUN apk add --update $RUNTIME_DEPS &&
apk add --virtual $BUILD_DEPS &&
cp /usr/bin/envsubst /usr/local/bin/envsubst &&
cp /usr/bin/sponge /usr/local/bin/sponge &&
apk del $BUILD_DEPS
COPY vault.json ./config/ COPY docker-entrypoint.sh . ENTRYPOINT ["/vault/docker-entrypoint.sh"] CMD ["vault","server","-config=/vault/config/vault.json"]
- Here is our entrypoint.sh file
#!/bin/sh envsubst '${DYNAMO_DB_TABLE_NAME} ${KMS_KEY_ID}' < /vault/config/vault.json | sponge /vault/config/vault.json exec "$@"
- Here is the error SS which we got in our service which is going to call vault internally.
- Here is the error SS which we got in our vault standby node for each request.
References:
Thanks for filing this ticket. Would you please provide us with the complete view of you Vault config for both Vault servers? Also, would you please make sure that there is no networking issue? Like is it possible to telnet from Vault_b host to Vault_a?
Also, it would be helpful to see the output of vault status
. Thank you!
I have similar issue (maybe the root cause is the same)
My setup: Kubernetes Cluster (hosted in DigitalOcean), Vault backend (KMS and storage) hosted in GCP (KMS, Bucket). HA enabled - 3 replicas. Vault installed from official Helm Chart. No customization or specific config (autounseal with KMS only)
All replicas are up and running. One is active, the rest are standby nodes.
vault status
command output from active node:
Key Value
--- -----
Recovery Seal Type shamir
Initialized true
Sealed false
Total Recovery Shares 5
Threshold 3
Version 1.12.0
Build Date 2022-10-10T18:14:33Z
Storage Type gcs
Cluster Name vault-cluster-a071b22d
Cluster ID 9cfee336-9083-4beb-8f5e-8fd048c4f144
HA Enabled true
HA Cluster https://vault-1.vault-internal:8201
HA Mode active
Active Since 2022-11-28T19:56:23.329704191Z
vault status
command output from standby node:
Key Value
--- -----
Recovery Seal Type shamir
Initialized true
Sealed false
Total Recovery Shares 5
Threshold 3
Version 1.12.0
Build Date 2022-10-10T18:14:33Z
Storage Type gcs
Cluster Name vault-cluster-a071b22d
Cluster ID 9cfee336-9083-4beb-8f5e-8fd048c4f144
HA Enabled true
HA Cluster https://vault-1.vault-internal:8201
HA Mode standby
Active Node Address http://10.244.0.91:8200
Everything looks good except the logs on standby nodes (both):
2022-11-28T20:08:02.969Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Canceled desc = context canceled"
2022-11-28T20:08:02.969Z [ERROR] core: forward request error: error="error during forwarding RPC request"
I checked connectivity between pods using pod IPs, hostnames - everything is ok.
- Below are the screenshots for vault configuration of both nodes.
- Below ss of vault status for both nodes vault-a (active )& vault-b (standby)
- Yes I have verified we are able to TELNET from both nodes to each other by using there IP.
We are running 5 nodes of Vault in HA mode in k8s and are seeing these errors every time a node is shut down.
I have similar issue (maybe the root cause is the same)
My setup: Kubernetes Cluster (hosted in DigitalOcean), Vault backend (KMS and storage) hosted in GCP (KMS, Bucket). HA enabled - 3 replicas. Vault installed from official Helm Chart. No customization or specific config (autounseal with KMS only)
All replicas are up and running. One is active, the rest are standby nodes.
vault status
command output from active node:Key Value --- ----- Recovery Seal Type shamir Initialized true Sealed false Total Recovery Shares 5 Threshold 3 Version 1.12.0 Build Date 2022-10-10T18:14:33Z Storage Type gcs Cluster Name vault-cluster-a071b22d Cluster ID 9cfee336-9083-4beb-8f5e-8fd048c4f144 HA Enabled true HA Cluster https://vault-1.vault-internal:8201 HA Mode active Active Since 2022-11-28T19:56:23.329704191Z
vault status
command output from standby node:Key Value --- ----- Recovery Seal Type shamir Initialized true Sealed false Total Recovery Shares 5 Threshold 3 Version 1.12.0 Build Date 2022-10-10T18:14:33Z Storage Type gcs Cluster Name vault-cluster-a071b22d Cluster ID 9cfee336-9083-4beb-8f5e-8fd048c4f144 HA Enabled true HA Cluster https://vault-1.vault-internal:8201 HA Mode standby Active Node Address http://10.244.0.91:8200
Everything looks good except the logs on standby nodes (both):
2022-11-28T20:08:02.969Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Canceled desc = context canceled" 2022-11-28T20:08:02.969Z [ERROR] core: forward request error: error="error during forwarding RPC request"
I checked connectivity between pods using pod IPs, hostnames - everything is ok.
We have the same errors as described in this comment, and the status output is also similar.
@hsimon-hashicorp @hghaf099 waiting for feedback.
ddjain wim-de-groot can you guys please share your vault configs?
We had the same issue and I'm wondering is it maybe related to server-to-server configuration.
Thanks in advance!
This issue appears to be duplicated in #19342, and I provided an answer there.
We are also facing this issue from time to time when the nodes are being restarted, the solution that worked for us was to increase the capacity on DynamoDB table, for some reason after the restart vault is congesting DynamoDB and the status check get timed out.
This issue appears to be duplicated in #19342, and I provided an answer there.
Agreed. Unless there's a specific need for this issue to remain open, I'd like to close it. Max's answer in the other issue should help people get their troubleshooting underway.
Closing as per previous comment.