nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Inconsystent cluster state if single consul/nomad server was rebootstrapped

Open EugenKon opened this issue 1 year ago • 1 comments

Nomad version

$ nomad --version
Nomad v1.8.2
BuildDate 2024-07-16T08:50:09Z
Revision 7f0822c1e4f25907d9f60e2d595411950dd1bd28

Operating system and Environment details

uname -a
Linux nomad-consul-server-0-c4cf 6.8.0-1016-aws #17-Ubuntu SMP Mon Sep  2 13:48:07 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Issue

We had a cluster with one EC2 instance with nomad and consul server. And two autoscalled EC2 instances with nomad and consul client. Cluster was deployed and works well.

Reproduction steps

Deploy cluster with one node with nomad/consul server and two instances with nomad/consul clients. Terminate host with nomad/consul server and create the new one. Nomad clients are not autojoined to the new nomad server. Though consul client does.


We did some changes to our configuration and run terraform plan/apply which terminates host with nomad/consul server. This is expected. When new host was created and nomad/consul rebootstrapped. Jobs on client nodes are still run and our application works well. Through consul WebUI we can see jobs which were run. 2024-10-15_16-51 nomad restart But nomad WebUI shows nothing: 2024-10-15_16-50 nomad restart Consul server is able to see all nodes: itself and two autoscale nodes.

This is not expected because the old consul client probably has the old keys and should not communicate with the new consul server.

2024-10-15_16-53 nodes are visible

When I checked nomad logs I can see 'permission denied' messages:

Oct 15 20:47:04 nomad-www-autoscale-5b54 nomad[4596]: client.rpc: error performing RPC to server: error="rpc error: Permission denied" rpc=Node.UpdateStatus server=172.31.202.9:4647
Oct 15 20:47:04 nomad-www-autoscale-5b54 nomad[4596]: client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: Permission denied" rpc=Node.UpdateStatus server=172.31.202.9:4647
Oct 15 20:47:04 nomad-www-autoscale-5b54 nomad[4596]: client: error heartbeating. retrying: error="failed to update status: rpc error: Permission denied" period=28.275141872s
Oct 15 20:47:20 nomad-www-autoscale-5b54 nomad[4596]:     2024-10-15T20:47:20.693Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: Permission denied" rpc=Node.GetClientAllocs server=172.31.202.9:4647
Oct 15 20:47:20 nomad-www-autoscale-5b54 nomad[4596]:     2024-10-15T20:47:20.693Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: Permission denied" rpc=Node.GetClientAllocs server=172.31.202.9:4647
Oct 15 20:47:20 nomad-www-autoscale-5b54 nomad[4596]:     2024-10-15T20:47:20.693Z [ERROR] client: error querying node allocations: error="rpc error: Permission denied"
Oct 15 20:47:20 nomad-www-autoscale-5b54 nomad[4596]: client.rpc: error performing RPC to server: error="rpc error: Permission denied" rpc=Node.GetClientAllocs server=172.31.202.9:4647
Oct 15 20:47:20 nomad-www-autoscale-5b54 nomad[4596]: client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: Permission denied" rpc=Node.GetClientAllocs server=172.31.202.9:4647
Oct 15 20:47:20 nomad-www-autoscale-5b54 nomad[4596]: client: error querying node allocations: error="rpc error: Permission denied"
Oct 15 20:47:32 nomad-www-autoscale-5b54 nomad[4596]:     2024-10-15T20:47:32.464Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: Permission denied" rpc=Node.UpdateStatus server=172.31.202.9:4647
Oct 15 20:47:32 nomad-www-autoscale-5b54 nomad[4596]:     2024-10-15T20:47:32.464Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: Permission denied" rpc=Node.UpdateStatus server=172.31.202.9:4647
Oct 15 20:47:32 nomad-www-autoscale-5b54 nomad[4596]:     2024-10-15T20:47:32.464Z [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: Permission denied" period=28.338532887s
Oct 15 20:47:32 nomad-www-autoscale-5b54 nomad[4596]: client.rpc: error performing RPC to server: error="rpc error: Permission denied" rpc=Node.UpdateStatus server=172.31.202.9:4647
Oct 15 20:47:32 nomad-www-autoscale-5b54 nomad[4596]: client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: Permission denied" rpc=Node.UpdateStatus server=172.31.202.9:4647
Oct 15 20:47:32 nomad-www-autoscale-5b54 nomad[4596]: client: error heartbeating. retrying: error="failed to update status: rpc error: Permission denied" period=28.338532887s

This is expected. The nomad server was rebootstrapped and new token was generated. It should not be able to communicate with that new nomad server. But this is strange. Why the old Consul client was able to communicate with the new consul server? Should Nomad client be able to communicate with the new cluster? eg. new nomad server joins the old nomad nodes on the cluster.

When I restared nomad client service the permission denied messages are gone. And I am able to see this node via Nomad WebUI: 2024-10-15_16-54 after nomad restart

After this action nomad destroyed all jobs on that host.

The same happened with the second autoscale instance: after restarting nomad service the node becomes visible. All jobs were destroyed. Finally nothing is run on the cluster at this moment. We experience downtime.

logs from worker autoscale
░░ The unit nomad.service has entered the 'failed' state with result 'exit-code'.
Oct 15 20:55:08 nomad-worker-autoscale-9d09 systemd[1]: nomad.service: Unit process 2690 (nomad) remains running after unit stopped.
Oct 15 20:55:08 nomad-worker-autoscale-9d09 systemd[1]: nomad.service: Unit process 2696 (nomad) remains running after unit stopped.
Oct 15 20:55:08 nomad-worker-autoscale-9d09 systemd[1]: nomad.service: Unit process 2930 (nomad) remains running after unit stopped.
Oct 15 20:55:08 nomad-worker-autoscale-9d09 systemd[1]: nomad.service: Unit process 3049 (nomad) remains running after unit stopped.
Oct 15 20:55:08 nomad-worker-autoscale-9d09 systemd[1]: Stopped nomad.service - Nomad.
░░ Subject: A stop job for unit nomad.service has finished
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A stop job for unit nomad.service has finished.
░░
░░ The job identifier is 118576 and the job result is done.
Oct 15 20:55:08 nomad-worker-autoscale-9d09 systemd[1]: nomad.service: Consumed 28min 18.818s CPU time, 79.4M memory peak, 0B memory swap peak.
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ The unit nomad.service completed and consumed the indicated resources.
Oct 15 20:55:08 nomad-worker-autoscale-9d09 systemd[1]: nomad.service: Found left-over process 2690 (nomad) in control group while starting unit. Ignoring.
Oct 15 20:55:08 nomad-worker-autoscale-9d09 systemd[1]: nomad.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 15 20:55:08 nomad-worker-autoscale-9d09 systemd[1]: nomad.service: Found left-over process 2696 (nomad) in control group while starting unit. Ignoring.
Oct 15 20:55:08 nomad-worker-autoscale-9d09 systemd[1]: nomad.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 15 20:55:08 nomad-worker-autoscale-9d09 systemd[1]: nomad.service: Found left-over process 2930 (nomad) in control group while starting unit. Ignoring.
Oct 15 20:55:08 nomad-worker-autoscale-9d09 systemd[1]: nomad.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 15 20:55:08 nomad-worker-autoscale-9d09 systemd[1]: nomad.service: Found left-over process 3049 (nomad) in control group while starting unit. Ignoring.
Oct 15 20:55:08 nomad-worker-autoscale-9d09 systemd[1]: nomad.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 15 20:55:08 nomad-worker-autoscale-9d09 systemd[1]: Starting nomad.service - Nomad...
░░ Subject: A start job for unit nomad.service has begun execution
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit nomad.service has begun execution.
░░
░░ The job identifier is 118576.
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]: ==> WARNING: mTLS is not configured - Nomad is not secure without mTLS!
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]: ==> Config enable_syslog is `true` with log_level=WARN
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]: ==> Loaded configuration from /etc/nomad.d/nomad.hcl
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]: ==> Starting Nomad agent...
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/data/nomad/client/plugins
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=ens5
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=ens5
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:  client.fingerprint_mgr.cni_plugins: failed to read CNI plugins directory: cni_path=/opt/cni/bin error="open /opt/cni/bin: no such file or directory"
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]: client.rpc: error performing RPC to server: error="rpc error: Permission denied" rpc=Node.Register server=172.31.202.9:4647
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]: client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: Permission denied" rpc=Node.Register server=172.31.202.9:4647
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]: client: error registering: error="rpc error: Permission denied"
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]: 2024-10-15T20:55:08.973Z [TRACE] plugin.stdio: waiting for stdio data
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]: 2024-10-15T20:55:08.982Z [TRACE] plugin.stdio: waiting for stdio data
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:  http: UI is disabled
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:  http: enable_debug is set to true. This is insecure and should not be enabled in production
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]: client.rpc: error performing RPC to server: error="rpc error: Permission denied" rpc=Node.GetClientAllocs server=172.31.202.9:4647
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]: client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: Permission denied" rpc=Node.GetClientAllocs server=172.31.202.9:4647
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]: client: error querying node allocations: error="rpc error: Permission denied"
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:  http: UI is disabled
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:  http: enable_debug is set to true. This is insecure and should not be enabled in production
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]: ==> Nomad agent configuration:
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:        Advertise Addrs: HTTP: 172.31.89.64:4646
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:             Bind Addrs: HTTP: [0.0.0.0:4646]
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:                 Client: true
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:              Log Level: WARN
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:                 Region: planitar (DC: plntr_dc)
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:                 Server: false
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:                Version: 1.8.2
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]: ==> Nomad agent started! Log data will stream in below:
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:08.885Z [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/data/nomad/client/plugins
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:08.914Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=ens5
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:08.916Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:08.919Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=ens5
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:08.929Z [WARN]  client.fingerprint_mgr.cni_plugins: failed to read CNI plugins directory: cni_path=/opt/cni/bin error="open /opt/cni/bin: no such file or directory"
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:08.965Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: Permission denied" rpc=Node.Register server=172.31.202.9:4647
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:08.965Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: Permission denied" rpc=Node.Register server=172.31.202.9:4647
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:08.967Z [ERROR] client: error registering: error="rpc error: Permission denied"
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:08.986Z [WARN]  http: UI is disabled
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:08.986Z [WARN]  http: enable_debug is set to true. This is insecure and should not be enabled in production
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:08.987Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: Permission denied" rpc=Node.GetClientAllocs server=172.31.202.9:4647
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:08.987Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: Permission denied" rpc=Node.GetClientAllocs server=172.31.202.9:4647
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:08.987Z [ERROR] client: error querying node allocations: error="rpc error: Permission denied"
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:08.988Z [WARN]  http: UI is disabled
Oct 15 20:55:08 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:08.988Z [WARN]  http: enable_debug is set to true. This is insecure and should not be enabled in production
Oct 15 20:55:08 nomad-worker-autoscale-9d09 systemd[1]: Started nomad.service - Nomad.
░░ Subject: A start job for unit nomad.service has finished successfully
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit nomad.service has finished successfully.
░░
░░ The job identifier is 118576.



Oct 15 20:55:32 nomad-worker-autoscale-9d09 nomad[229886]: 2024-10-15T20:55:32.679Z [DEBUG] plugin.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = error reading from server: EOF"
Oct 15 20:55:32 nomad-worker-autoscale-9d09 nomad[229886]: 2024-10-15T20:55:32.734Z [DEBUG] plugin.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = error reading from server: EOF"
Oct 15 20:55:32 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:32.795Z [WARN]  client.driver_mgr.docker: RemoveImage on non-referenced counted image id: driver=docker image_id=sha256:b059e7589f7bbf6cf753e6677fa129ba74282b87897c76da0750194fad53f112
Oct 15 20:55:32 nomad-worker-autoscale-9d09 nomad[229886]:  client.driver_mgr.docker: RemoveImage on non-referenced counted image id: driver=docker image_id=sha256:b059e7589f7bbf6cf753e6677fa129ba74282b87897c76da0750194fad53f112
Oct 15 20:55:32 nomad-worker-autoscale-9d09 nomad[229886]:     2024-10-15T20:55:32.907Z [WARN]  client.driver_mgr.docker: RemoveImage on non-referenced counted image id: driver=docker image_id=sha256:5b6ae44fd3635be58cc29553fb6b7bd41a9d7a74e5150db6211446a2aadc545a
Oct 15 20:55:32 nomad-worker-autoscale-9d09 nomad[229886]:  client.driver_mgr.docker: RemoveImage on non-referenced counted image id: driver=docker image_id=sha256:5b6ae44fd3635be58cc29553fb6b7bd41a9d7a74e5150db6211446a2aadc545a
Oct 15 20:55:32 nomad-worker-autoscale-9d09 nomad[229886]: 2024-10-15T20:55:32.973Z [DEBUG] plugin: reattached plugin process exited
Oct 15 20:55:32 nomad-worker-autoscale-9d09 nomad[229886]: 2024-10-15T20:55:32.973Z [DEBUG] plugin: plugin exited
Oct 15 20:55:32 nomad-worker-autoscale-9d09 nomad[229886]: 2024-10-15T20:55:32.983Z [DEBUG] plugin: reattached plugin process exited
Oct 15 20:55:32 nomad-worker-autoscale-9d09 nomad[229886]: 2024-10-15T20:55:32.983Z [DEBUG] plugin: plugin exited
Oct 15 20:55:34 nomad-worker-autoscale-9d09 nomad[229886]: ==> Newer Nomad version available: 1.9.0 (currently running: 1.8.2)


2024-10-15_16-57 after nomad restart on worker

Expected Result

I do not know how nomad was designed. Here is expectation from my side. Because Consul server is able to join consul client and list all current allocations/tasks without problem and Nomad server is able to communicate with nomad client after nomad client service was restarted, the nomad client should be able to detect the new nomad server and list all current its tasks (Or the new nomad server should be able to send some command to the old nomad client and reinitialize its connection without killing current allocations).

Actual Result

If nomad/consul server is rebootstrapped OR instances with nomad/consule client are run couple of minutes before nomad/consul server instance is bootstrapped, then those nomad clients are not able to communicate with nomad/consul server.

EugenKon avatar Oct 15 '24 22:10 EugenKon

@EugenKon what's likely happening here is that Consul clients are using autodiscovery to find Consul servers and then Nomad is trying to use Consul to discover Nomad servers. But "rebootstrapped" isn't clear enough to determine what's happening after that. Are you destroying all the mTLS certs as well as server state? Are you restoring from snapshot? Just starting from nothing and running ACL bootstrap again?

This isn't a suitable production deployment because you have only a single node and you're sharing the same host for Nomad and Consul servers. It's not really clear to me why you would destroy the server and expect anything to work after that. Consul agents have a lot of local state so all that state will be inconsistent with a brand new Consul server. And all the Nomad workloads will be destroyed because the Nomad clients will have allocations that don't exist in the new Nomad server.

If you're completely destroying the server rather than recovering it, you should expect you'll need to wipe all state on the clients as well. In any case, there's no way for us to diagnose further without the Nomad and Consul ACL policies and configuration files. But I'd recommend just not doing what you're doing here.

tgross avatar Oct 16 '24 12:10 tgross

Didn't hear back from you and your cluster topology isn't well-supported, so I'm going to close this out.

tgross avatar Nov 08 '24 16:11 tgross

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar Mar 09 '25 02:03 github-actions[bot]