agent configuration reload API
Today if you want to pick up configuration changes in your Nomad clients or servers, you need to SIGHUP the agent. This is fine if you've got configuration management running that is replacing files and can reload the agent. But it's less useful if you want to force a client to fingerprint reloadable fingerprints, or if you want to stage updated configuration files across the cluster before reloading.
Let's add an Agent Reload API, that allows cluster admins to reload the agent's configuration identically to SIGHUP by calling the Agent.Reload() method. We'll want to gate this operation behind ACLs, of course... most likely with agent = "write" permissions.
@tgross I don't know if anyone is working on this, but I have a use case for this and can start poking at it.
No one is working on it currently (internal ref NMD-370 SR-202), so you're welcome to take a crack at it if you have time. It's a fairly big lift but mostly just because new RPC endpoints have an unfortunately large amount of boilerplate... we're not using gRPC so there's no easy codegen.
There's a RPC endpoint checklist and this particular item has a couple quirks:
- This endpoint won't update Raft, so we can skip anything in the FSM
- The endpoint is a "client RPC" that may have to reach out to a client node, so you end up with a handler in
nomad/client_agent_endpoint.goand a handler inclient/agent_endpoint.go. - There's a HTTP handler in
command/agent/agent_endpoint.go.
I would also assume that we'll want to have a CLI command to trigger this API. That's got one interesting point of design where we have nomad node and nomad server subcommands, but no nomad agent subcommands (other than running the agent). Maybe we end up with a new subcommand on nomad config? In any case, the CLI is it's own extra chunk of code to deal with so it's probably be a good idea to do that in a separate PR. But it would also be a good idea to have that separate PR in mind when implementing the shape of the API.
Design-wise there's nothing too complicated about this, so if you want to make a PR we can talk through details like the exact URL there, or if you want to talk thru it here that's fine too.
Thanks for that detail @tgross. Maybe my use case does not fit here (or I am showing that I am not a developer or understand the inner workings of Nomad) - but it would be more directly interacting with the locally running agent.
Really the goal of this was to have a method to reload the local agent without needing permissions on the host to trigger the reload. This would be driven by a Nomad token + permissions to trigger the reload.
So command wise I had been thinking:
nomad agent reload
API I was thinking it could work like this :
curl -X PUT \
--header "X-Nomad-Token: TOKEN-WITH-AGENT-WRITE" \
http://localhost:4646/v1/agent/reload
This is more of a unique use case where files can be written, but the account that can write the files does not have permissions to reload. Typically, I would leverage a systemd unit file to watch for changes and trigger the reload, but I have bumped into a few scenarios where systemd was not used with Nomad.
Actually seems like consul has something similar with consul reload.
more directly interacting with the locally running agent.
The HTTP API typically forwards any requests to the node that needs them via RPC. So a /v1/agent/reload targeting the local agent would be handled entirely on the locally running agent. Having the RPC forwarding lets us use the exact same API to target anywhere in the cluster. It looks like this is a bit of a mixed bag with the Agent API though, where some commands support forwarding and others do not.
For this API I'd really like to be able to support forwarding to remote nodes... it's a really handy capability to have whenever we've added it. That being said, having the capability work locally and then incrementally adding forwarding support later wouldn't be entirely unreasonable either. (I do want to cc @arodd here and get his thoughts on this, as he may have encountered customers looking for this.)
Actually seems like consul has something similar with consul reload.
Yup. But in terms of the command line, don't love having a top-level command for this rather than nestling into another subcommand, just because the top-level subcommands gets crowded.
A wrinkle that occurred to me here is that there's an architectural challenge getting the RPCs to support config reload. You can sort of thing of the Nomad agent as running 3 different top-level components: the "agent" (the HTTP API, config parsing, signal handling in command/agent/), the "server" (the control plane, schedulers, Raft, etc. in nomad/) and the "client" (spawns the workloads, in client/). The server and client don't "speak" HTTP, only RPC. But only the agent component has access to the config reload and it makes a method call to the server and client to reload their state from a new configuration.
So if we were to have RPC forwarding support, we'd need to thread a means of communicating back from the RPC handler to the agent in order to trigger the config reload there, which would then in turn trigger the server/client component reloads. That's doable (a simple matter of code, as they say) but certainly complicates the implementation further. So that makes mean lean more towards trying to break this feature up into chunks:
- API for config reload in the agent
- CLI for triggering that API
- RPC handlers for forwarding the request to other agents
I will probably spin up another issue which is less in scope and blast radius - more targeted towards a local or single host to reload its configuration (it also is probably more where I am comfortable regarding my knowledge wise and the inner workings of Nomad). I do think this is a valuable feature, I may not be the right person at this time to work on it and do not want to bog anyone down to handhold me through it.
That's fine if you want to address a smaller scope but let's keep the design discussion here so that we're not losing track of related work.
I'll write this up similar to a FR issue to keep it more in line. Also, @arodd if you have any thoughts on this. Hopefully this all makes sense and is clear 🤞. If I am completely off on this, feel free to let me know. This is more focused on those who run self-managed/hybrid environments where it is not simple enough to repave instances and have them load up new certificates on launch.
Proposal
Currently, dynamically reloading a Nomad agent's configuration requires sending a SIGHUP signal to the agent process on the host, as mentioned earlier on in this issue. This operation makes host-level access necessary, requiring either the same user account running the Nomad agent or an account with elevated permissions to trigger this. It creates a significant challenge for automated and secure workflows in tightly controlled environments.
This proposal introduces a new API endpoint and corresponding CLI command to trigger this reload logic via Nomad's API. It allows the reload to be performed by any process (like a Vault Agent, user, or automated service) that can authenticate with a properly-scoped Nomad ACL token - ideally, vault will also vend out these short lived tokens. This would allow operators to perform a reload without requiring host level access or accounts.
Proposed CLI and API
Both examples assume ACL is enabled and a token with agent write access.
CLI Command
nomad agent reload -token <token>
--- I thought of nomad config reload, but under agent might make more sense.
Vault does not have this functionality, and as mentioned earlier, Consul has a consul reload, but I don't believe that is the direction we go for the command.
API Endpoint
curl -X PUT \
--header "X-Nomad-Token: <token>" \
http://127.0.0.1:4646/v1/agent/reload
Attempted Solutions (without API)
The primary driver for this feature is to simplify mTLS certificate rotation in long-lived, self-managed cloud environments. I have been primarily working with the Vault agent to automate this process, but also tested Nomad using workload identity with Vault to perform this.
Attempt 1: Vault Agent
A Vault Agent can be run on the host to fetch and write new certificates.
To complete the rotation, this agent:
- writes the new certs to disk
- trigger the SIGHUP on the Nomad agent
- it would also need to update Nomad's config file with the new TLS file names
The second step forces a decision to be made, either run the Vault Agent as the same higher-privilege user as Nomad or grant it sudo-like permissions to send the signal. This goes against environments that force least privilege and is often prohibited in tightly-controlled environments.
Attempt 2: Nomad Job with Workload Identity
Attempted a more unique approach which did not require the Vault agent running on the host by running a Nomad system job that uses Workload Identity to fetch certificates from Vault. This job used a template block to write the new certificates on the host and updated the Nomad TLS config block.
Use-cases
-
Provide an API/CLI workflow for reloading agents remotely and/or against a group of servers.
-
Leverage Nomad to trigger the reload and not a user or service with permissions to trigger the reload.
-
Step towards simplifying the mTLS update process. I would like to look at potentially just having Nomad work when updating certificates with the same file name and path --- not requiring the tls block to be updated for it to reload mTLS.
-
Maybe a Terraform action on this with the Nomad provider in the future or in an Ansible playbook?
I don't think that all adds much to the design other than context around the "why" questions (which is fine). This doesn't change the chunks that I've described if we want to allow this to be extended in the future. We also shouldn't do nomad agent reload because that would add a subcommand on the agent subcommand. So the implementation should be:
- PR 1: a new HTTP API endpoint
/v1/agent/reloadthat requiresagent:writeand reloads the configuration of the agent that's hit - PR 2: a CLI
nomad config reloadthat uses this API endpoint
And the "future work" is:
- PR 3: a RPC endpoint that can handle the reload command to forward this API to other nodes, along with the query params we'd need on the HTTP API to configure that.
I haven't been involved in conversations relating to this just yet, but based on experiences configuring our other products while working in PS, the API approach allowing a local agent reload(or even remote via http agent api vs rpc forwarding) is definitely something that would have been useful in certain scenarios we ran into requiring unprivileged(system) configuration updates where the CLI/API keys would still be available.
I think starting with the first two PR's before better understanding the demand and applied workflows requiring the remote RPC logic makes sense to me.
I should have the first PR out today for the API, tested the changes locally and on remote instances with client/server, and both log level/TLS changes.