nomad Nomad dies if HUP is sent during agent initialization

Nomad version

Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)

Operating system and Environment details

Distributor ID: Ubuntu Description: Ubuntu 16.04.2 LTS Release: 16.04

Issue

We're provisioning nomad with chef recipe that send reload on config update and restart on binary update:

hashistack_bring_binary 'nomad' do
  url node['nomad']['download_url']
  unzip true
  notifies :restart, 'systemd_unit[nomad.service]', :delayed
end

template '/etc/nomad.d/agent.hcl' do
  source 'nomad.hcl.erb'
  owner 'nomad'
  group 'root'
  mode '0644'
  sensitive true
  notifies :reload_or_try_restart, 'systemd_unit[nomad.service]', :delayed
end

systemd_unit 'nomad.service' do
  content <<-EOU.gsub(/^\s+/, '')
[Unit]
Description=Nomad
Documentation=https://nomadproject.io/docs/
Requires=network-online.target
After=network-online.target

[Service]
LimitNOFILE=65536
Restart=on-failure
ExecStart=/usr/local/bin/nomad agent -config /etc/nomad.d
ExecReload=/bin/kill -HUP $MAINPID
RestartSec=5s

[Install]
WantedBy=multi-user.target
  EOU
  action [ :create, :enable ]
  notifies :restart, 'systemd_unit[nomad.service]', :delayed
end

Chef triggers restart and immediately after this reload. On new machine if nomad data directory is not initialized and nomad receives HUP it just exits with 0:

# systemctl status nomad
● nomad.service - Nomad
   Loaded: loaded (/etc/systemd/system/nomad.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Mon 2018-02-19 09:08:02 UTC; 38s ago
     Docs: https://nomadproject.io/docs/
 Main PID: 23627 (code=killed, signal=HUP)

Feb 19 09:08:02 ip-x-x-x-x systemd[1]: Started Nomad.
Feb 19 09:08:02 ip-x-x-x-x nomad[23627]:     Loaded configuration from /etc/nomad.d/agent.hcl
Feb 19 09:08:02 ip-x-x-x-x nomad[23627]: ==> Starting Nomad agent...
Feb 19 09:08:02 ip-x-x-x-x systemd[1]: Reloading Nomad.
Feb 19 09:08:02 ip-x-x-x-x systemd[1]: Reloaded Nomad.

On machine where data directory exists it succeeds to start and reload

Reproduction steps

Send HUP signal while nomad is initializing node(very fast)

Feb 20 '18 08:02 burdandrei

Hi, thanks for opening this issue. We've added this to our team's near-term roadmap.

Feb 20 '18 20:02 chelseakomlo

I just ran into this issue as well -> restart needed to update vault token (#4593) and reload triggered when nomad certificates were updated resulted in nomad unexpectedly stopping.

Apr 04 '19 01:04 nvx

We have hit this a few times and it's always an expensive one to hunt down :/

For us, it happens when we're refreshing trust anchors (root CAs) that are rendered w/ consul-template. Downstream triggers cause a nomad cert rotation at the same time we restart nomad in order to refresh nomad's trust anchors (so we have a restart + sighup shortly after).

Nov 08 '19 20:11 SoMuchToGrok

I was recently looking in this area of the code base and it's a bit tricky of a situation. It takes a certain amount of time for the agent to get itself setup to a point where it's safe to handle the signal for SIGHUP, and if we were to accept the signal before we could handle it, we'd end up dropping the reload signal in the situation described here instead (which would be worse).

We probably can set up the signal handler almost immediately but have it emit messages that we pick up in a loop that isn't started until later. This would minimize but not entirely get rid of the race condition in signals on startup -- there's just a certain amount of work that needs to be done before we can even know if we're starting up an agent or not.

Nov 08 '19 21:11 tgross

Thanks for the feedback @tgross, The idea to send the signal to the channel, but start the reader of the signals when nomad is ready for this sounds really good!

Nov 10 '19 07:11 burdandrei

Thanks for the feedback @tgross, The idea to send the signal to the channel, but start the reader of the signals when nomad is ready for this sounds really good!

Any news? This solution seems good to me. Do you need any help for it?

Jan 27 '22 09:01 elcomtik

@tgross I'm really disappointed how is this issue sit here for almost 5 years.

I use the latest Nomad v1.4.3 and still experiencing a strange issue when the Nomad service receives a HUP signal during init. Now it won't crash, but stays running without being able to receive jobs from the cluster.

Currently, Nomad is 1.X and in my opinion still not in production-ready quality. Such features as integrating properly with the systemd notification mechanism (https://github.com/hashicorp/nomad/issues/4466) or this one here should have been prioritized already. Please, if you don't want to resolve issues at least clean them up.

Feb 06 '23 16:02 elcomtik

Hi @elcomtik. Sorry for the frustrating experience. We do intend to fix this but have not prioritized it because we have not found it to be a widespread problem for users. Prioritizing work appropriately is always a challenge. In the past we have been more aggressive about closing "stale" issues, but we found that was a massive disservice to ourselves and users. A bug's age does not seem to be correlated to its relevance or priority.

Do you need any help for it?

A PR would be very welcome as we have no prior art for integrating with systemd to work from and testing system integrations + signals + race conditions is always a significant effort!

Feb 08 '23 18:02 schmichael

We've also encountered this issue on Ubuntu nodes using systemd when using consul-template to refresh short-lived TLS certificates/keys on nodes.

If a node reboots and the consul-template service starts with/shortly after the Nomad service, the SIGHUP terminates Nomad and the service doesn't show as failed in systemd. Consul also exhibits this behaviour.

As a workaround we've added ExecStartPre=/bin/sleep 30 to the consul-template.service systemd unit file so consul-template waits a bit for Nomad and Consul to have a chance to start up properly for it generates template on startup and then sends a SIGHUP to ask Nomad and Consul to reload their configs.

It seems weird to me that a SIGHUP would cause a graceful shutdown when Nomad is starting, but not when Nomad is actually running.

Mar 11 '23 00:03 jpwarren

I've looked at this a couple times (as I'm affected by this myself), and believe it is a combination of two factors:

Go's default SIGHUP handling
Not setting os/signal handlers early enough in service init

Unlike the standard linux behavior for uncaught SIGHUP, which terminates the program, Go simply exits:

A SIGHUP, SIGINT, or SIGTERM signal causes the program to exit.

This, along with where the signal handlers are set, notably after a lot of presumably time consuming startup means there is a significant window in which the default Go behavior is in effect.

I've not investigated os/signal, but from my runtime observations it does the equivalent of os.Exit(0), immediately exiting, which can sometimes result in odd behavior of subsequent Nomad runs (though I've never witnessed data corruption).

Ideally, the call to signal.Notify would happen much earlier in program startup, even if the handler loop still waited til the end.

May 21 '23 10:05 bazaah

Hi @bazaah!

Unlike the standard linux behavior for uncaught SIGHUP, which terminates the program, Go simply exits:

That's a bit of a red herring here. Go "exiting" is because it doesn't mask the signal so it just does the usual Unix behavior, which is to exit on signal. There's no difference between being "terminated" and "exiting" here.

This, along with where the signal handlers are set, notably after a lot of presumably time consuming startup means there is a significant window in which the default Go behavior is in effect.

Exactly. You're right the startup is a bit time consuming and the gap between the agent process starting and the signal handlers being registered is the root cause of the problem. Those signal handlers unfortunately rely on various bits of the agent to be constructed before they are safe to call, which is why this whole problem exists and we're not just registering the signal handlers first thing. For example, in order to reload the config on SIGHUP, we first need the original config to be parsed and the main server or client goroutines to be running. You also can't just say "oh well make that nil-safe, done" either -- that would create a new bug where signals that arrive before those bits of the agent are constructed will effectively get dropped. (And presumably that's a problem because the whole reason this ticket exists is that folks are trying to send signals very quickly after startup.)

The most reasonable way to do this that I can see is to split the behavior of the signal masks and the "real" signal handlers so that we can register the masks first thing after agent startup, and have them send messages on a channel that the signal handler functions use. That channel would have to be created very early and probably should be buffered, so that signals we're interested in get queued-up for consumption later on.

May 22 '23 13:05 tgross

At least in my environment there's two places this occurred:

At startup

My nomad nodes are accompanied by Vault agents, and there's a Before=nomad.service clause so that the relevant ACLs and secrets are templated out before Nomad starts. Unfortunately this doesn't really work practice because the agent process is (reasonably) considered started when the process comes up. This however leads to it triggering a SIGHUP reliably during Nomad's startup; killing the process.

This is also semantically where however Go handles uncaught SIGHUPs differs from my expectation: a Nomad agent dieing from a SIGHUP is not considered a failure for the purposes of systemd's Restart=on-failure, which is odd and differs from if you send a SIGHUP to e.g grep.
Multiple signals at the same time

Mostly from having reloads as a part of multiple templates commands. This doesn't happen anymore, and I'm not sure why. Either Nomad got better at handling multiple signals or consul-template is doing some behind the scenes dedup of commands to execute.

Both of these are handled by a systemd service / script that ensures Nomad is functional before triggering a reload, and then sleeping for a few seconds so-as to debounce rapid reloads.

The most reasonable way to do this that I can see is to split the behavior of the signal masks and the "real" signal handlers so that we can register the masks first thing after agent startup, and have them send messages on a channel that the signal handler functions use. That channel would have to be created very early and probably should be buffered, so that signals we're interested in get queued-up for consumption later on.

Agreed, admittedly I've got no idea if there's cgo bits that might interfere with this, but the naive approach is to lift signalCh to the start of Run and modify handleSignals() to take the chan as an arg:

func (c *Command) handleSignals(signalCh <-chan os.Signal) int {
	WAIT:
	// use signalCh as normal...
}

Also not sure if handleSignals() is used elsewhere, but I doubt it, given its infinite loop properties

May 22 '23 13:05 bazaah

This may also be a symptom of an XY problem. It seems like the use-case for this is to be able to load short lived TLS certificates, and short-lived Vault tokens.

Would it be simpler to simply detect updates to the tls cert/key file contents and allow for a vault_token_file option where Nomad can load and watch for changes in the Vault token as generated by Vault agent sink. This would solve the actual problem of "It's hard to use HashiCorp Nomad with secrets issued from HashiCorp Vault" which when you say it out aloud seems a little silly being they're both HashiCorp products without having to solve the general case of "reload any pieces of config dynamically super early in Nomads startup".

May 23 '23 06:05 nvx

@tgross Is this a similar issue to what is solved by using systemd notify in consul here?

Jun 27 '23 08:06 ahjohannessen

@ahjohannessen arguably that would help (and we have that open in https://github.com/hashicorp/nomad/issues/4466) inasmuch as it would reduce the risk that users send the signal prematurely. That only helps Linux users on systemd-based distros, which admittedly is a large majority of our production install base. But we'd still need to mask signals up to the point of notification to make it actually safe in totally avoiding this problem.

Jun 27 '23 12:06 tgross

@tgross I understand. Reason this has my interest is only because I use short lived TLS certificates from Vault with consul-template that causes Nomad to stop while starting up because of systemd reload.

Jun 27 '23 12:06 ahjohannessen

Just bumping to say that this is still an issue when using consul-template or vault-agent to template nomad credentials and reloading like advised in official docs e.g. here: https://developer.hashicorp.com/nomad/tutorials/integrate-vault/vault-pki-nomad

Mar 23 '24 23:03 Legogris

Nomad 1.8.0 will ship support for the systemd notify protocol.

May 03 '24 17:05 tgross

@tgross found this issue because have the same case described here. what's about docs? i think it's really confusing now that is not described in official docs when you setup mtls via consul-template: https://developer.hashicorp.com/nomad/tutorials/integrate-vault/vault-pki-nomad

it's also not described when you setup your own binary based nomad installation: https://developer.hashicorp.com/nomad/tutorials/enterprise/production-deployment-guide-vm-with-consul#configure-systemd

It took me and you many hours of debugging, so let's make it so that people don't have to deal with this anymore!

Aug 29 '24 08:08 Alexsandr-Random

nomad nomad copied to clipboard

Nomad dies if HUP is sent during agent initialization

Nomad version

Operating system and Environment details

Issue

Reproduction steps

nomad
nomad copied to clipboard