nomad
nomad copied to clipboard
Nomad dies if HUP is sent during agent initialization
Nomad version
Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)
Operating system and Environment details
Distributor ID: Ubuntu Description: Ubuntu 16.04.2 LTS Release: 16.04
Issue
We're provisioning nomad with chef recipe that send reload on config update and restart on binary update:
hashistack_bring_binary 'nomad' do
url node['nomad']['download_url']
unzip true
notifies :restart, 'systemd_unit[nomad.service]', :delayed
end
template '/etc/nomad.d/agent.hcl' do
source 'nomad.hcl.erb'
owner 'nomad'
group 'root'
mode '0644'
sensitive true
notifies :reload_or_try_restart, 'systemd_unit[nomad.service]', :delayed
end
systemd_unit 'nomad.service' do
content <<-EOU.gsub(/^\s+/, '')
[Unit]
Description=Nomad
Documentation=https://nomadproject.io/docs/
Requires=network-online.target
After=network-online.target
[Service]
LimitNOFILE=65536
Restart=on-failure
ExecStart=/usr/local/bin/nomad agent -config /etc/nomad.d
ExecReload=/bin/kill -HUP $MAINPID
RestartSec=5s
[Install]
WantedBy=multi-user.target
EOU
action [ :create, :enable ]
notifies :restart, 'systemd_unit[nomad.service]', :delayed
end
Chef triggers restart and immediately after this reload. On new machine if nomad data directory is not initialized and nomad receives HUP it just exits with 0:
# systemctl status nomad
● nomad.service - Nomad
Loaded: loaded (/etc/systemd/system/nomad.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2018-02-19 09:08:02 UTC; 38s ago
Docs: https://nomadproject.io/docs/
Main PID: 23627 (code=killed, signal=HUP)
Feb 19 09:08:02 ip-x-x-x-x systemd[1]: Started Nomad.
Feb 19 09:08:02 ip-x-x-x-x nomad[23627]: Loaded configuration from /etc/nomad.d/agent.hcl
Feb 19 09:08:02 ip-x-x-x-x nomad[23627]: ==> Starting Nomad agent...
Feb 19 09:08:02 ip-x-x-x-x systemd[1]: Reloading Nomad.
Feb 19 09:08:02 ip-x-x-x-x systemd[1]: Reloaded Nomad.
On machine where data directory exists it succeeds to start and reload
Reproduction steps
Send HUP signal while nomad is initializing node(very fast)
Hi, thanks for opening this issue. We've added this to our team's near-term roadmap.
I just ran into this issue as well -> restart needed to update vault token (#4593) and reload triggered when nomad certificates were updated resulted in nomad unexpectedly stopping.
We have hit this a few times and it's always an expensive one to hunt down :/
For us, it happens when we're refreshing trust anchors (root CAs) that are rendered w/ consul-template. Downstream triggers cause a nomad cert rotation at the same time we restart nomad in order to refresh nomad's trust anchors (so we have a restart + sighup shortly after).
I was recently looking in this area of the code base and it's a bit tricky of a situation. It takes a certain amount of time for the agent to get itself setup to a point where it's safe to handle the signal for SIGHUP, and if we were to accept the signal before we could handle it, we'd end up dropping the reload signal in the situation described here instead (which would be worse).
We probably can set up the signal handler almost immediately but have it emit messages that we pick up in a loop that isn't started until later. This would minimize but not entirely get rid of the race condition in signals on startup -- there's just a certain amount of work that needs to be done before we can even know if we're starting up an agent or not.
Thanks for the feedback @tgross, The idea to send the signal to the channel, but start the reader of the signals when nomad is ready for this sounds really good!
Thanks for the feedback @tgross, The idea to send the signal to the channel, but start the reader of the signals when nomad is ready for this sounds really good!
Any news? This solution seems good to me. Do you need any help for it?
@tgross I'm really disappointed how is this issue sit here for almost 5 years.
I use the latest Nomad v1.4.3 and still experiencing a strange issue when the Nomad service receives a HUP signal during init. Now it won't crash, but stays running without being able to receive jobs from the cluster.
Currently, Nomad is 1.X and in my opinion still not in production-ready quality. Such features as integrating properly with the systemd notification mechanism (https://github.com/hashicorp/nomad/issues/4466) or this one here should have been prioritized already. Please, if you don't want to resolve issues at least clean them up.
Hi @elcomtik. Sorry for the frustrating experience. We do intend to fix this but have not prioritized it because we have not found it to be a widespread problem for users. Prioritizing work appropriately is always a challenge. In the past we have been more aggressive about closing "stale" issues, but we found that was a massive disservice to ourselves and users. A bug's age does not seem to be correlated to its relevance or priority.
Do you need any help for it?
A PR would be very welcome as we have no prior art for integrating with systemd to work from and testing system integrations + signals + race conditions is always a significant effort!
We've also encountered this issue on Ubuntu nodes using systemd
when using consul-template
to refresh short-lived TLS certificates/keys on nodes.
If a node reboots and the consul-template
service starts with/shortly after the Nomad service, the SIGHUP
terminates Nomad and the service doesn't show as failed in systemd. Consul also exhibits this behaviour.
As a workaround we've added ExecStartPre=/bin/sleep 30
to the consul-template.service
systemd unit file so consul-template
waits a bit for Nomad and Consul to have a chance to start up properly for it generates template on startup and then sends a SIGHUP
to ask Nomad and Consul to reload their configs.
It seems weird to me that a SIGHUP
would cause a graceful shutdown when Nomad is starting, but not when Nomad is actually running.
I've looked at this a couple times (as I'm affected by this myself), and believe it is a combination of two factors:
- Go's default SIGHUP handling
- Not setting
os/signal
handlers early enough in service init
Unlike the standard linux behavior for uncaught SIGHUP
, which terminates the program, Go simply exits:
A SIGHUP, SIGINT, or SIGTERM signal causes the program to exit.
This, along with where the signal handlers are set, notably after a lot of presumably time consuming startup means there is a significant window in which the default Go behavior is in effect.
I've not investigated os/signal
, but from my runtime observations it does the equivalent of os.Exit(0)
, immediately exiting, which can sometimes result in odd behavior of subsequent Nomad runs (though I've never witnessed data corruption).
Ideally, the call to signal.Notify would happen much earlier in program startup, even if the handler loop still waited til the end.
Hi @bazaah!
Unlike the standard linux behavior for uncaught SIGHUP, which terminates the program, Go simply exits:
That's a bit of a red herring here. Go "exiting" is because it doesn't mask the signal so it just does the usual Unix behavior, which is to exit on signal. There's no difference between being "terminated" and "exiting" here.
This, along with where the signal handlers are set, notably after a lot of presumably time consuming startup means there is a significant window in which the default Go behavior is in effect.
Exactly. You're right the startup is a bit time consuming and the gap between the agent process starting and the signal handlers being registered is the root cause of the problem. Those signal handlers unfortunately rely on various bits of the agent to be constructed before they are safe to call, which is why this whole problem exists and we're not just registering the signal handlers first thing. For example, in order to reload the config on SIGHUP, we first need the original config to be parsed and the main server or client goroutines to be running. You also can't just say "oh well make that nil-safe, done" either -- that would create a new bug where signals that arrive before those bits of the agent are constructed will effectively get dropped. (And presumably that's a problem because the whole reason this ticket exists is that folks are trying to send signals very quickly after startup.)
The most reasonable way to do this that I can see is to split the behavior of the signal masks and the "real" signal handlers so that we can register the masks first thing after agent startup, and have them send messages on a channel that the signal handler functions use. That channel would have to be created very early and probably should be buffered, so that signals we're interested in get queued-up for consumption later on.
At least in my environment there's two places this occurred:
-
At startup
My nomad nodes are accompanied by Vault agents, and there's a
Before=nomad.service
clause so that the relevant ACLs and secrets are templated out before Nomad starts. Unfortunately this doesn't really work practice because the agent process is (reasonably) considered started when the process comes up. This however leads to it triggering a SIGHUP reliably during Nomad's startup; killing the process.This is also semantically where however Go handles uncaught SIGHUPs differs from my expectation: a Nomad agent dieing from a
SIGHUP
is not considered a failure for the purposes of systemd's Restart=on-failure, which is odd and differs from if you send a SIGHUP to e.ggrep
. -
Multiple signals at the same time
Mostly from having reloads as a part of multiple templates
command
s. This doesn't happen anymore, and I'm not sure why. Either Nomad got better at handling multiple signals or consul-template is doing some behind the scenes dedup of commands to execute.
Both of these are handled by a systemd service / script that ensures Nomad is functional before triggering a reload, and then sleeping for a few seconds so-as to debounce rapid reloads.
The most reasonable way to do this that I can see is to split the behavior of the signal masks and the "real" signal handlers so that we can register the masks first thing after agent startup, and have them send messages on a channel that the signal handler functions use. That channel would have to be created very early and probably should be buffered, so that signals we're interested in get queued-up for consumption later on.
Agreed, admittedly I've got no idea if there's cgo bits that might interfere with this, but the naive approach is to lift signalCh to the start of Run and modify handleSignals()
to take the chan as an arg:
func (c *Command) handleSignals(signalCh <-chan os.Signal) int {
WAIT:
// use signalCh as normal...
}
Also not sure if handleSignals()
is used elsewhere, but I doubt it, given its infinite loop properties
This may also be a symptom of an XY problem. It seems like the use-case for this is to be able to load short lived TLS certificates, and short-lived Vault tokens.
Would it be simpler to simply detect updates to the tls cert/key file contents and allow for a vault_token_file option where Nomad can load and watch for changes in the Vault token as generated by Vault agent sink. This would solve the actual problem of "It's hard to use HashiCorp Nomad with secrets issued from HashiCorp Vault" which when you say it out aloud seems a little silly being they're both HashiCorp products without having to solve the general case of "reload any pieces of config dynamically super early in Nomads startup".
@tgross Is this a similar issue to what is solved by using systemd notify in consul here?
@ahjohannessen arguably that would help (and we have that open in https://github.com/hashicorp/nomad/issues/4466) inasmuch as it would reduce the risk that users send the signal prematurely. That only helps Linux users on systemd-based distros, which admittedly is a large majority of our production install base. But we'd still need to mask signals up to the point of notification to make it actually safe in totally avoiding this problem.
@tgross I understand. Reason this has my interest is only because I use short lived TLS certificates from Vault with consul-template that causes Nomad to stop while starting up because of systemd reload.
Just bumping to say that this is still an issue when using consul-template or vault-agent to template nomad credentials and reloading like advised in official docs e.g. here: https://developer.hashicorp.com/nomad/tutorials/integrate-vault/vault-pki-nomad
Nomad 1.8.0 will ship support for the systemd notify protocol.
@tgross found this issue because have the same case described here. what's about docs? i think it's really confusing now that is not described in official docs when you setup mtls via consul-template: https://developer.hashicorp.com/nomad/tutorials/integrate-vault/vault-pki-nomad
it's also not described when you setup your own binary based nomad installation: https://developer.hashicorp.com/nomad/tutorials/enterprise/production-deployment-guide-vm-with-consul#configure-systemd
It took me and you many hours of debugging, so let's make it so that people don't have to deal with this anymore!