nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Vault 'default' name is not set on server,

Open rwenz3l opened this issue 1 year ago • 8 comments

Nomad version

Output from nomad version

client+server:

$ nomad version
Nomad v1.7.3
BuildDate 2024-01-15T16:55:40Z
Revision 60ee328f97d19d2d2d9761251b895b06d82eb1a1

Operating system and Environment details

3 VMs for the servers, many client nodes for the jobs. All running Rocky Linux release 9.3

Issue

I recently started looking into vault integrations - and while this worked in the past, I noticed with a recent test on the newer version that I get an error when scheduling jobs:

Error submitting job: Unexpected response code: 500 (rpc error: 1 error occurred:
        * Vault "default" not enabled but used in the job)

The error comes from here:

https://github.com/hashicorp/nomad/blob/1e04fc461394d96bd4aab0e50cfa80048e1b5fd0/nomad/job_endpoint_hook_vault.go#L38

The name options is mentioned here and mention that it should be omitted for non-enterprise setups: https://developer.hashicorp.com/nomad/docs/configuration/vault#parameters-for-nomad-clients-and-servers

The job spec mentions it here: https://developer.hashicorp.com/nomad/docs/job-specification/vault#cluster

my vault config on the servers was like this:

vault {
  enabled          = true
  token            = "{{ nomad_vault_token }}"
  address          = "https://vault.****.com"
  create_from_role = "nomad-cluster-access-auth"
}

config of the job/task:

      vault {
        # Attach our default policies to the task,
        # so it is able to retrieve secrets from vault.
        policies = ["nomad-cluster-access-kv"]
      }

I noticed there has been some work done on this, e.g. here: https://github.com/hashicorp/nomad/commit/1ef99f05364b7d3739befa6a789f0d55b2314dcf

and I think there might be a bug with the initialization of the "default" value.. It's either not set or not read.

Reproduction steps

I think one might be able to reproduce this by setting up a 1.7.3 cluster and simply integrating vault. If I add the name = default to both server and client, it works. If I don't I get the mentioned error message.

Expected Result

The "default" cluster is available by default.

Actual Result

Error submitting job: Unexpected response code: 500 (rpc error: 1 error occurred:
        * Vault "default" not enabled but used in the job)

rwenz3l avatar Feb 07 '24 14:02 rwenz3l

Hi @rwenz3l 👋

I have not been able to reproduce this problem 🤔

Would you be able to share the Vault configuration as returned by the command /v1/agent/self? You will need to run this on each of your servers.

Could you also make sure all three servers are running Nomad v1.7.3?

Thanks!

lgfa29 avatar Feb 07 '24 22:02 lgfa29

Sure:

ctrl1
    "Vaults": [
      {
        "Addr": "https://vault.*******.com",
        "AllowUnauthenticated": true,
        "ConnectionRetryIntv": 30000000000,
        "DefaultIdentity": null,
        "Enabled": true,
        "JWTAuthBackendPath": "jwt-nomad",
        "Name": "default",
        "Namespace": "",
        "Role": "nomad-cluster-access-auth",
        "TLSCaFile": "",
        "TLSCaPath": "",
        "TLSCertFile": "",
        "TLSKeyFile": "",
        "TLSServerName": "",
        "TLSSkipVerify": null,
        "TaskTokenTTL": "",
        "Token": "<redacted>"
      }
    ],
ctrl2
    "Vaults": [
      {
        "Addr": "https://vault.********.com",
        "AllowUnauthenticated": true,
        "ConnectionRetryIntv": 30000000000,
        "DefaultIdentity": null,
        "Enabled": true,
        "JWTAuthBackendPath": "jwt-nomad",
        "Name": "default",
        "Namespace": "",
        "Role": "nomad-cluster-access-auth",
        "TLSCaFile": "",
        "TLSCaPath": "",
        "TLSCertFile": "",
        "TLSKeyFile": "",
        "TLSServerName": "",
        "TLSSkipVerify": null,
        "TaskTokenTTL": "",
        "Token": "<redacted>"
      }
    ],
ctrl3
    "Vaults": [
      {
        "Addr": "https://vault.**********.com",
        "AllowUnauthenticated": true,
        "ConnectionRetryIntv": 30000000000,
        "DefaultIdentity": null,
        "Enabled": true,
        "JWTAuthBackendPath": "jwt-nomad",
        "Name": "default",
        "Namespace": "",
        "Role": "nomad-cluster-access-auth",
        "TLSCaFile": "",
        "TLSCaPath": "",
        "TLSCertFile": "",
        "TLSKeyFile": "",
        "TLSServerName": "",
        "TLSSkipVerify": null,
        "TaskTokenTTL": "",
        "Token": "<redacted>"
      }
    ],

I will continue my work on the vault integration and gather some more info with this.

rwenz3l avatar Feb 08 '24 13:02 rwenz3l

Thanks for the extra information @rwenz3l.

All three configuration look right, "Name": "default" and "Enabled": true. Is the cluster a fresh install or have upgraded the servers from a previous version of Nomad.

As an aside, you mentioned you're just starting to look into the Vault integration so I would suggest you to follow the new workflow released in Nomad 1.7 as this will become the only supported option in the future. Here's a tutorial that covers it: https://developer.hashicorp.com/nomad/tutorials/integrate-vault/vault-acl

lgfa29 avatar Feb 08 '24 20:02 lgfa29

We've been running this nomad since 1.3 or something iirc., we usually update to the latest major/minor shortly after release.

We definitely plan to use the new workload identities with this, I initially configured the vault integration when workload identity did not exist yet. It was working fine back then, so I guess it was something before the 1.7.x that maybe did something to this key/value. From my limited view, it feels like the default value was not read properly, if the key is missing in the nomad.hcl configuration. No need to invest too much time, I would advise to set the "name = default" in the nomad config in case someone sees this error. If I find some more info, I will update here.

rwenz3l avatar Feb 08 '24 20:02 rwenz3l

I had the same error after configuring the Vault integration following the new 1.7 workflow.

After a few tries, I realized this was caused by a syntax error, and I was missing a comma inside the vault block of the nomad config file (which in my case, is written in json).

I would expect Nomad to not start at all while having a syntax error in the json config file, but apparently it only made the Vault integration not work? It might be something similar in your case.

Tirieru avatar Mar 06 '24 22:03 Tirieru

Thanks for the extra info @Tirieru. Improving agent configuration validation is something that's been on our plate for a bit now (https://github.com/hashicorp/nomad/pull/11819).

Would you be able to share the exact invalid configuration that caused this error? I have not been able to reproduce it yet.

Thanks!

lgfa29 avatar Mar 20 '24 23:03 lgfa29

This is how the nomad server configuration looked like while the error was happening:

{
  "name": "nomad-1",
  "data_dir": "/opt/nomad/data",
  "bind_addr": "<HOST_ADDRESS>",
  "datacenter": "dc1",
  "ports": {
    "http": 4646,
    "rpc": 4647,
    "serf": 4648
  },
  "addresses": {
    "http": "0.0.0.0",
    "rpc": "0.0.0.0",
    "serf": "0.0.0.0"
  },
  "advertise": {
    "http": "<HOST_ADDRESS>",
    "rpc": "<HOST_ADDRESS>",
    "serf": "<HOST_ADDRESS>"
  },
  "acl": {
    "enabled": true
  },
  "server": {
    "enabled": true,
    "rejoin_after_leave": true,
    "raft_protocol": 3,
    "encrypt": "<ENCRYPT_KEY>",
    "bootstrap_expect": 1,
    "job_gc_interval": "1h",
    "job_gc_threshold": "24h",
    "deployment_gc_threshold": "120h",
    "heartbeat_grace": "60s"
  },
  "limits": {
    "http_max_conns_per_client": 300,
    "rpc_max_conns_per_client": 300
  },
  "vault": {
    "token": "<VAULT_TOKEN>",
    "create_from_role": "nomad-cluster",
    "default_identity": {
      "aud": ["<VAULT_AUD>"],
      "ttl": ""
    }
    "address": "<VAULT_ADDRESS>",
    "enabled": true  
  },
  "log_level": "INFO"
}

Adding the missing comma on line 45 fixed the issue.

Tirieru avatar Mar 21 '24 02:03 Tirieru

Thank you @Tirieru!

Yes, I can verify that the invalid JSON does cause the same error message but, unlike in the case of @rwenz3l, the /v1/agent/self API does return the default Vault configuration as disabled:

    "Vaults": [
      {
        "Addr": "https://vault.service.consul:8200",
        "AllowUnauthenticated": true,
        "ConnectionRetryIntv": 30000000000,
        "DefaultIdentity": {
          "Audience": [
            "vault.io"
          ],
          "Env": null,
          "File": null,
          "TTL": null
        },
        "Enabled": null,
        "JWTAuthBackendPath": "jwt-nomad",
        "Name": "default",
        "Namespace": "",
        "Role": "nomad-cluster",
        "TLSCaFile": "",
        "TLSCaPath": "",
        "TLSCertFile": "",
        "TLSKeyFile": "",
        "TLSServerName": "",
        "TLSSkipVerify": null,
        "TaskTokenTTL": "",
        "Token": "<redacted>"
      }
    ],

I'm not sure why this configuration is accepted though. I think the root cause is that Nomad agent configuration is still parsed with the old HCLv1 syntax, which has a less strict JSON parser.

lgfa29 avatar Mar 21 '24 23:03 lgfa29