vault icon indicating copy to clipboard operation
vault copied to clipboard

Vault is giving error while request forwarding.

Open ddjain opened this issue 2 years ago • 2 comments

Environment: DEV/PROD

  • Vault Version:1.11.0
  • Operating System/Architecture:AWS EC2 Machine (Linux)

Vault Config File:

    "storage": {
        "dynamodb": {
            "ha_enabled": "true",
            "region": "${AWS_REGION}",
            "table": "${DYNAMO_DB_TABLE_NAME}"
        }
    },
    "listener": {
        "tcp": {
            "address": "0.0.0.0:8200",
            "tls_disable": "true"
        }
    },
    "seal": {
        "awskms": {
            "region":"${AWS_REGION}",
            "kms_key_id":"${KMS_KEY_ID}"
        }
    },
    "ui": true
}

Startup Log Output:

2022-11-16T13:55:38.294Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.0.2.253:444: i/o timeout\""
2022-11-16T13:55:38.294Z [ERROR] core: forward request error: error="error during forwarding RPC request"
2022-11-16T13:55:38.357Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.0.2.253:444: i/o timeout\""
2022-11-16T13:55:38.357Z [ERROR] core: forward request error: error="error during forwarding RPC request"
2022-11-16T13:55:38.375Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.0.2.253:444: i/o timeout\""
2022-11-16T13:55:38.375Z [ERROR] core: forward request error: error="error during forwarding RPC request"
2022-11-16T13:55:38.394Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.0.2.253:444: i/o timeout\""
2022-11-16T13:55:38.394Z [ERROR] core: forward request error: error="error during forwarding RPC request"

Expected Behavior:

  • We have setup of 2 vault nodes as vault-a & vault-b , each others are resolved by his DNS (vault-a.com & vault-b.com) expected that it will do request forwarding with each other without any error.

Actual Behavior:

  • All the time when we hit request to get data from vault above error appears in vault and the service which is calling that internally vault gives 500(internal server error).(posted ss below.)

Steps to Reproduce:

  • We have 2 nodes with same configuration and same database as vault-a and vault-b

  • One of the node is active and another one is in standby.

  • We placed a LoadBalancer which will forward req. on each node in round robin.

  • whenever any request come we are getting above RPC error in our console of stand by node.

  • We are providing vault_api_addr through our env variables which are DNS EX.vault-a.com & vault-b.com

Important Factoids:

  • Here is our docker-file also

FROM vault:1.11.0 WORKDIR /vault

ENV DYNAMO_DB_TABLE_NAME=vault ENV RUNTIME_DEPS=gettext ENV BUILD_DEPS="build_deps moreutils"

RUN apk add --update $RUNTIME_DEPS &&
apk add --virtual $BUILD_DEPS &&
cp /usr/bin/envsubst /usr/local/bin/envsubst &&
cp /usr/bin/sponge /usr/local/bin/sponge &&
apk del $BUILD_DEPS

COPY vault.json ./config/ COPY docker-entrypoint.sh . ENTRYPOINT ["/vault/docker-entrypoint.sh"] CMD ["vault","server","-config=/vault/config/vault.json"]

  • Here is our entrypoint.sh file

#!/bin/sh envsubst '${DYNAMO_DB_TABLE_NAME} ${KMS_KEY_ID}' < /vault/config/vault.json | sponge /vault/config/vault.json exec "$@"

  • Here is the error SS which we got in our service which is going to call vault internally.

DeepinScreenshot_select-area_20221117135438

  • Here is the error SS which we got in our vault standby node for each request.

DeepinScreenshot_select-area_20221117173720

References:

ddjain avatar Nov 17 '22 12:11 ddjain

Thanks for filing this ticket. Would you please provide us with the complete view of you Vault config for both Vault servers? Also, would you please make sure that there is no networking issue? Like is it possible to telnet from Vault_b host to Vault_a? Also, it would be helpful to see the output of vault status. Thank you!

hghaf099 avatar Nov 17 '22 15:11 hghaf099

I have similar issue (maybe the root cause is the same)

My setup: Kubernetes Cluster (hosted in DigitalOcean), Vault backend (KMS and storage) hosted in GCP (KMS, Bucket). HA enabled - 3 replicas. Vault installed from official Helm Chart. No customization or specific config (autounseal with KMS only)

All replicas are up and running. One is active, the rest are standby nodes.

vault status command output from active node:

Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    5
Threshold                3
Version                  1.12.0
Build Date               2022-10-10T18:14:33Z
Storage Type             gcs
Cluster Name             vault-cluster-a071b22d
Cluster ID               9cfee336-9083-4beb-8f5e-8fd048c4f144
HA Enabled               true
HA Cluster               https://vault-1.vault-internal:8201
HA Mode                  active
Active Since             2022-11-28T19:56:23.329704191Z

vault status command output from standby node:

Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    5
Threshold                3
Version                  1.12.0
Build Date               2022-10-10T18:14:33Z
Storage Type             gcs
Cluster Name             vault-cluster-a071b22d
Cluster ID               9cfee336-9083-4beb-8f5e-8fd048c4f144
HA Enabled               true
HA Cluster               https://vault-1.vault-internal:8201
HA Mode                  standby
Active Node Address      http://10.244.0.91:8200

Everything looks good except the logs on standby nodes (both):

2022-11-28T20:08:02.969Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Canceled desc = context canceled"
2022-11-28T20:08:02.969Z [ERROR] core: forward request error: error="error during forwarding RPC request"

I checked connectivity between pods using pod IPs, hostnames - everything is ok.

mirkoszy avatar Nov 28 '22 20:11 mirkoszy

  • Below are the screenshots for vault configuration of both nodes.

DeepinScreenshot_select-area_20221202135749

DeepinScreenshot_select-area_20221202135629

  • Below ss of vault status for both nodes vault-a (active )& vault-b (standby)

DeepinScreenshot_select-area_20221202134937

DeepinScreenshot_select-area_20221202134756

  • Yes I have verified we are able to TELNET from both nodes to each other by using there IP.

ddjain avatar Dec 02 '22 08:12 ddjain

We are running 5 nodes of Vault in HA mode in k8s and are seeing these errors every time a node is shut down.

rcomanne avatar Dec 05 '22 09:12 rcomanne

I have similar issue (maybe the root cause is the same)

My setup: Kubernetes Cluster (hosted in DigitalOcean), Vault backend (KMS and storage) hosted in GCP (KMS, Bucket). HA enabled - 3 replicas. Vault installed from official Helm Chart. No customization or specific config (autounseal with KMS only)

All replicas are up and running. One is active, the rest are standby nodes.

vault status command output from active node:

Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    5
Threshold                3
Version                  1.12.0
Build Date               2022-10-10T18:14:33Z
Storage Type             gcs
Cluster Name             vault-cluster-a071b22d
Cluster ID               9cfee336-9083-4beb-8f5e-8fd048c4f144
HA Enabled               true
HA Cluster               https://vault-1.vault-internal:8201
HA Mode                  active
Active Since             2022-11-28T19:56:23.329704191Z

vault status command output from standby node:

Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    5
Threshold                3
Version                  1.12.0
Build Date               2022-10-10T18:14:33Z
Storage Type             gcs
Cluster Name             vault-cluster-a071b22d
Cluster ID               9cfee336-9083-4beb-8f5e-8fd048c4f144
HA Enabled               true
HA Cluster               https://vault-1.vault-internal:8201
HA Mode                  standby
Active Node Address      http://10.244.0.91:8200

Everything looks good except the logs on standby nodes (both):

2022-11-28T20:08:02.969Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Canceled desc = context canceled"
2022-11-28T20:08:02.969Z [ERROR] core: forward request error: error="error during forwarding RPC request"

I checked connectivity between pods using pod IPs, hostnames - everything is ok.

We have the same errors as described in this comment, and the status output is also similar.

wim-de-groot avatar Dec 05 '22 10:12 wim-de-groot

@hsimon-hashicorp @hghaf099 waiting for feedback.

ddjain avatar Dec 26 '22 10:12 ddjain

ddjain wim-de-groot can you guys please share your vault configs?

We had the same issue and I'm wondering is it maybe related to server-to-server configuration.

Thanks in advance!

t0klian avatar Apr 12 '23 17:04 t0klian

This issue appears to be duplicated in #19342, and I provided an answer there.

maxb avatar Apr 12 '23 19:04 maxb

We are also facing this issue from time to time when the nodes are being restarted, the solution that worked for us was to increase the capacity on DynamoDB table, for some reason after the restart vault is congesting DynamoDB and the status check get timed out.

ap-vishal avatar Apr 26 '23 15:04 ap-vishal

This issue appears to be duplicated in #19342, and I provided an answer there.

Agreed. Unless there's a specific need for this issue to remain open, I'd like to close it. Max's answer in the other issue should help people get their troubleshooting underway.

heatherezell avatar Apr 26 '23 21:04 heatherezell

Closing as per previous comment.

heatherezell avatar Mar 21 '24 21:03 heatherezell