nomad
nomad copied to clipboard
Vault 'default' name is not set on server,
Nomad version
Output from nomad version
client+server:
$ nomad version
Nomad v1.7.3
BuildDate 2024-01-15T16:55:40Z
Revision 60ee328f97d19d2d2d9761251b895b06d82eb1a1
Operating system and Environment details
3 VMs for the servers, many client nodes for the jobs. All running Rocky Linux release 9.3
Issue
I recently started looking into vault integrations - and while this worked in the past, I noticed with a recent test on the newer version that I get an error when scheduling jobs:
Error submitting job: Unexpected response code: 500 (rpc error: 1 error occurred:
* Vault "default" not enabled but used in the job)
The error comes from here:
https://github.com/hashicorp/nomad/blob/1e04fc461394d96bd4aab0e50cfa80048e1b5fd0/nomad/job_endpoint_hook_vault.go#L38
The name options is mentioned here and mention that it should be omitted for non-enterprise setups: https://developer.hashicorp.com/nomad/docs/configuration/vault#parameters-for-nomad-clients-and-servers
The job spec mentions it here: https://developer.hashicorp.com/nomad/docs/job-specification/vault#cluster
my vault config on the servers was like this:
vault {
enabled = true
token = "{{ nomad_vault_token }}"
address = "https://vault.****.com"
create_from_role = "nomad-cluster-access-auth"
}
config of the job/task:
vault {
# Attach our default policies to the task,
# so it is able to retrieve secrets from vault.
policies = ["nomad-cluster-access-kv"]
}
I noticed there has been some work done on this, e.g. here: https://github.com/hashicorp/nomad/commit/1ef99f05364b7d3739befa6a789f0d55b2314dcf
and I think there might be a bug with the initialization of the "default" value.. It's either not set or not read.
Reproduction steps
I think one might be able to reproduce this by setting up a 1.7.3 cluster and simply integrating vault.
If I add the name = default
to both server and client, it works. If I don't I get the mentioned error message.
Expected Result
The "default" cluster is available by default.
Actual Result
Error submitting job: Unexpected response code: 500 (rpc error: 1 error occurred:
* Vault "default" not enabled but used in the job)
Hi @rwenz3l 👋
I have not been able to reproduce this problem 🤔
Would you be able to share the Vault configuration as returned by the command /v1/agent/self
? You will need to run this on each of your servers.
Could you also make sure all three servers are running Nomad v1.7.3?
Thanks!
Sure:
ctrl1
"Vaults": [
{
"Addr": "https://vault.*******.com",
"AllowUnauthenticated": true,
"ConnectionRetryIntv": 30000000000,
"DefaultIdentity": null,
"Enabled": true,
"JWTAuthBackendPath": "jwt-nomad",
"Name": "default",
"Namespace": "",
"Role": "nomad-cluster-access-auth",
"TLSCaFile": "",
"TLSCaPath": "",
"TLSCertFile": "",
"TLSKeyFile": "",
"TLSServerName": "",
"TLSSkipVerify": null,
"TaskTokenTTL": "",
"Token": "<redacted>"
}
],
ctrl2
"Vaults": [
{
"Addr": "https://vault.********.com",
"AllowUnauthenticated": true,
"ConnectionRetryIntv": 30000000000,
"DefaultIdentity": null,
"Enabled": true,
"JWTAuthBackendPath": "jwt-nomad",
"Name": "default",
"Namespace": "",
"Role": "nomad-cluster-access-auth",
"TLSCaFile": "",
"TLSCaPath": "",
"TLSCertFile": "",
"TLSKeyFile": "",
"TLSServerName": "",
"TLSSkipVerify": null,
"TaskTokenTTL": "",
"Token": "<redacted>"
}
],
ctrl3
"Vaults": [
{
"Addr": "https://vault.**********.com",
"AllowUnauthenticated": true,
"ConnectionRetryIntv": 30000000000,
"DefaultIdentity": null,
"Enabled": true,
"JWTAuthBackendPath": "jwt-nomad",
"Name": "default",
"Namespace": "",
"Role": "nomad-cluster-access-auth",
"TLSCaFile": "",
"TLSCaPath": "",
"TLSCertFile": "",
"TLSKeyFile": "",
"TLSServerName": "",
"TLSSkipVerify": null,
"TaskTokenTTL": "",
"Token": "<redacted>"
}
],
I will continue my work on the vault integration and gather some more info with this.
Thanks for the extra information @rwenz3l.
All three configuration look right, "Name": "default"
and "Enabled": true
. Is the cluster a fresh install or have upgraded the servers from a previous version of Nomad.
As an aside, you mentioned you're just starting to look into the Vault integration so I would suggest you to follow the new workflow released in Nomad 1.7 as this will become the only supported option in the future. Here's a tutorial that covers it: https://developer.hashicorp.com/nomad/tutorials/integrate-vault/vault-acl
We've been running this nomad since 1.3 or something iirc., we usually update to the latest major/minor shortly after release.
We definitely plan to use the new workload identities with this, I initially configured the vault integration when workload identity did not exist yet. It was working fine back then, so I guess it was something before the 1.7.x that maybe did something to this key/value. From my limited view, it feels like the default value was not read properly, if the key is missing in the nomad.hcl configuration. No need to invest too much time, I would advise to set the "name = default" in the nomad config in case someone sees this error. If I find some more info, I will update here.
I had the same error after configuring the Vault integration following the new 1.7 workflow.
After a few tries, I realized this was caused by a syntax error, and I was missing a comma inside the vault block of the nomad config file (which in my case, is written in json).
I would expect Nomad to not start at all while having a syntax error in the json config file, but apparently it only made the Vault integration not work? It might be something similar in your case.
Thanks for the extra info @Tirieru. Improving agent configuration validation is something that's been on our plate for a bit now (https://github.com/hashicorp/nomad/pull/11819).
Would you be able to share the exact invalid configuration that caused this error? I have not been able to reproduce it yet.
Thanks!
This is how the nomad server configuration looked like while the error was happening:
{
"name": "nomad-1",
"data_dir": "/opt/nomad/data",
"bind_addr": "<HOST_ADDRESS>",
"datacenter": "dc1",
"ports": {
"http": 4646,
"rpc": 4647,
"serf": 4648
},
"addresses": {
"http": "0.0.0.0",
"rpc": "0.0.0.0",
"serf": "0.0.0.0"
},
"advertise": {
"http": "<HOST_ADDRESS>",
"rpc": "<HOST_ADDRESS>",
"serf": "<HOST_ADDRESS>"
},
"acl": {
"enabled": true
},
"server": {
"enabled": true,
"rejoin_after_leave": true,
"raft_protocol": 3,
"encrypt": "<ENCRYPT_KEY>",
"bootstrap_expect": 1,
"job_gc_interval": "1h",
"job_gc_threshold": "24h",
"deployment_gc_threshold": "120h",
"heartbeat_grace": "60s"
},
"limits": {
"http_max_conns_per_client": 300,
"rpc_max_conns_per_client": 300
},
"vault": {
"token": "<VAULT_TOKEN>",
"create_from_role": "nomad-cluster",
"default_identity": {
"aud": ["<VAULT_AUD>"],
"ttl": ""
}
"address": "<VAULT_ADDRESS>",
"enabled": true
},
"log_level": "INFO"
}
Adding the missing comma on line 45 fixed the issue.
Thank you @Tirieru!
Yes, I can verify that the invalid JSON does cause the same error message but, unlike in the case of @rwenz3l, the /v1/agent/self
API does return the default Vault configuration as disabled:
"Vaults": [
{
"Addr": "https://vault.service.consul:8200",
"AllowUnauthenticated": true,
"ConnectionRetryIntv": 30000000000,
"DefaultIdentity": {
"Audience": [
"vault.io"
],
"Env": null,
"File": null,
"TTL": null
},
"Enabled": null,
"JWTAuthBackendPath": "jwt-nomad",
"Name": "default",
"Namespace": "",
"Role": "nomad-cluster",
"TLSCaFile": "",
"TLSCaPath": "",
"TLSCertFile": "",
"TLSKeyFile": "",
"TLSServerName": "",
"TLSSkipVerify": null,
"TaskTokenTTL": "",
"Token": "<redacted>"
}
],
I'm not sure why this configuration is accepted though. I think the root cause is that Nomad agent configuration is still parsed with the old HCLv1 syntax, which has a less strict JSON parser.