nomad icon indicating copy to clipboard operation
nomad copied to clipboard

[ERROR] nomad.client: Vault token creation for alloc failed: alloc_id=XXX error="Connection to Vault has not been established"

Open allantaylor8907 opened this issue 3 years ago • 2 comments

Nomad version

Server cluster versions

Amazon Linux 1 Nomad - v 1.0.0 Vault - 1.5.4 Consul - 1.9.0

Nomad worker cluster versions

Ubuntu 20.04 Nomad - 1.1.3 Vault - 1.8.1 Consul - 1.10.1

Vault cluster

Amazon linux 1 Vault - 1.5.4 Consul - 1.9.0 Nomad - (not running) 1.0.0

Issue

Nomad has no connection to Vault

    2022-02-24T16:55:11.441Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader=
    2022-02-24T16:55:11.441Z [INFO]  nomad.raft: entering candidate state: node="Node at 172.17.4.134:4647 [Candidate]" term=4749
    2022-02-24T16:55:11.463Z [INFO]  nomad.raft: election won: tally=2
    2022-02-24T16:55:11.463Z [INFO]  nomad.raft: entering leader state: leader="Node at 172.17.4.134:4647 [Leader]"
    2022-02-24T16:55:11.463Z [INFO]  nomad.raft: added peer, starting replication: peer=172.17.4.41:4647
    2022-02-24T16:55:11.463Z [INFO]  nomad.raft: added peer, starting replication: peer=172.17.4.74:4647
    2022-02-24T16:55:11.464Z [INFO]  nomad: cluster leadership acquired
    2022-02-24T16:55:11.467Z [INFO]  nomad.raft: pipelining replication: peer="{Voter 172.17.4.41:4647 172.17.4.41:4647}"
    2022-02-24T16:55:11.511Z [INFO]  nomad.raft: pipelining replication: peer="{Voter 172.17.4.74:4647 172.17.4.74:4647}"
    2022-02-24T16:55:11.606Z [ERROR] nomad.fsm: deregistering job failed: job=plugin-aws-ebs-nodes error="DeleteJob failed: deleting job from plugin: plugin missing: aws-ebs0 <nil>"
    2022-02-24T16:55:14.946Z [ERROR] nomad.client: Vault token creation for alloc failed: alloc_id=00d08572-5d41-ff55-ddbd-9b973c09075e error="Connection to Vault has not been established"
    2022-02-24T16:55:21.695Z [WARN]  nomad.vault: failed to contact Vault API: retry=30s error="Get "https://active.vault.service.consul:8200/v1/sys/health?drsecondarycode=299&performancestandbycode=299&sealedcode=299&standbycode=299&uninitcode=299": Forbidden"
    2022-02-24T16:56:02.426Z [WARN]  nomad.vault: failed to contact Vault API: retry=30s error="Get "https://active.vault.service.consul:8200/v1/sys/health?drsecondarycode=299&performancestandbycode=299&sealedcode=299&standbycode=299&uninitcode=299": Forbidden"

Reproduction steps

I followed the policy and role creation detailed HERE and receive these errors when running a job.

Expected Result

The job to run normally

Actual Result

Errors in logs above

[ERROR] nomad.client: Vault token creation for alloc failed: alloc_id=XXX error="Connection to Vault has not been established"

Nomad Config

datacenter = "us-east-1f"
name       = "REDACTED"
region     = "us-east-1"
bind_addr  = "0.0.0.0"

advertise {
  http = "REDACTED"
  rpc  = "REDACTED"
  serf = "REDACTED"
}



server {
  enabled = true
  bootstrap_expect = 3
}

consul {
  address = "127.0.0.1:8500"
}

vault {
  enabled = true
  address = "https://active.vault.service.consul:8200"
  task_token_ttl = "1h"
  create_from_role = "nomad-cluster"
  token = "REDACTED"
}

I have tried the root token with the same results. The vault token appears to work when using the Vault CLI on the Nomad servers and I am able to use the token created to create additional tokens but Nomad itself I cannot get to function. I am currently working on updating versions on all the systems.

allantaylor8907 avatar Feb 24 '22 18:02 allantaylor8907

Hi @allantaylor8907 👋

Are you using TLS with Vault? Your vault config is using https, but it seems to be missing the mTLS CA, key, and cert files (https://www.nomadproject.io/docs/configuration/vault#ca_file).

lgfa29 avatar Feb 25 '22 16:02 lgfa29

experiencing same issue, when using nomad 1.3.5 and vault 1.11.3

Oct 19 12:51:45 ip-10-11-94-193 nomad[32370]:     2022-10-19T12:51:45.595Z [ERROR] nomad.client: Vault token creation for alloc failed: alloc_id=08358105-749e-a001-3306-626916518553
Oct 19 12:51:45 ip-10-11-94-193 nomad[32370]:   error=
Oct 19 12:51:45 ip-10-11-94-193 nomad[32370]:   | failed to create an alloc vault token: Error making API request.
Oct 19 12:51:45 ip-10-11-94-193 nomad[32370]:   |
Oct 19 12:51:45 ip-10-11-94-193 nomad[32370]:   | URL: POST https://vault-server-us-west-2-88002.example.io:8200/v1/auth/token/create/nomad-server-aws-us-west-2-001
Oct 19 12:51:45 ip-10-11-94-193 nomad[32370]:   | Code: 403. Errors:
Oct 19 12:51:45 ip-10-11-94-193 nomad[32370]:   |
Oct 19 12:51:45 ip-10-11-94-193 nomad[32370]:   | * permission denied
Oct 19 12:51:45 ip-10-11-94-193 nomad[32370]:  

above problem will gone if we manually create token again, change config, then send SIGHUP signal to nomad server

is there anything to look at? @tgross @lgfa29

nomad server vault block :

vault {
  enabled = true
  address = "https://vault-server-us-west-2-88002.example.io:8200"

  ca_file         = "/opt/vault/tls/ca.crt"
  cert_file       = "/opt/vault/tls/tls.crt"
  key_file        = "/opt/vault/tls/tls.key"
  tls_server_name = "vault"

  allow_unauthenticated = true
  create_from_role      = "nomad-server-aws-us-west-2-001"
  token                 = "hvs.AAANNNZZZZZZZZZZ"
}

vault policy

path "secrets/data/nomad-server/aws/us-west-2/001/*" {
  capabilities = [ "create", "read" , "update" ]
}

path "secrets/data/nomad/*" {
  capabilities = [ "create", "read" , "update" ]
}

# Allow creating tokens under "nomad-server-aws-us-west-2-001" role. The role name should be
# updated if "nomad-server-aws-us-west-2-001" is not used.
path "auth/token/create/nomad-server-aws-us-west-2-001" {
  capabilities = ["update"]
}

# Allow looking up "nomad-server-aws-us-west-2-001" role. The role name should be updated if
# "nomad-server-aws-us-west-2-001" is not used.
path "auth/token/roles/nomad-server-aws-us-west-2-001" {
  capabilities = ["read"]
}

# Allow looking up the token passed to Nomad to validate the token has the
# proper capabilities. This is provided by the "default" policy.
path "auth/token/lookup-self" {
  capabilities = ["read"]
}

# Allow looking up incoming tokens to validate they have permissions to access
# the tokens they are requesting. This is only required if
# `allow_unauthenticated` is set to false.
path "auth/token/lookup" {
  capabilities = ["update"]
}

# Allow revoking tokens that should no longer exist. This allows revoking
# tokens for dead tasks.
path "auth/token/revoke-accessor" {
  capabilities = ["update"]
}

# Allow checking the capabilities of our own token. This is used to validate the
# token upon startup.
path "sys/capabilities-self" {
  capabilities = ["update"]
}

# Allow our own token to be renewed.
path "auth/token/renew-self" {
  capabilities = ["update"]
}

kholisrag avatar Oct 19 '22 12:10 kholisrag

Hi @kholisrag (or anyone else who finds this when searching for the error message), I got the same error on a cluster with mismatching Nomad/Vault versions. Updating everything to Vault 1.15.2 and Nomad 1.6.2/1.6.3 worked.

TimoWilken avatar Nov 15 '23 16:11 TimoWilken

I'm reviewing open Vault issues following the new Vault workload identity work (ref https://github.com/hashicorp/nomad/issues/15617). It looks like the original reported issue here was pretty clearly the TLS configuration and we never got around to mopping this up. I'm going to close this out but if someone from the community encounters similar problems we'll be more than happy to revisit in a new issue.

tgross avatar Dec 01 '23 21:12 tgross