nomad empirical model of Nomad client failover

The behavior of Nomad client heartbeats under various failure scenarios has been a subject of internal discussion, resulting in new documentation for Client Heartbeats. This issue is a report of experiments done around a particular scenario: a client that's connected to a non-leader server that is no longer able to accept heartbeats.

tl;dr

Nomad client failover behaves as expected in the face of server shutdowns.
Nomad client failover will frequently miss a heartbeat in the face of network splits.
Adding a tunable for the yamux keep-alive timeout (currently set to the default 30s) could provide Nomad administrators the ability to have clients failover more quickly in the face of network-level outages, but at the risk of increased flapping during server-level outages such as upgrades. It's not at all clear this is a good tradeoff, especially when they can just increase the heartbeat_grace window.
The surprisingly large yamux stream connect timeout of 75s appears to have no impact on Nomad's behavior.
You can find some cool tracing tools in https://github.com/tgross/nomad-yamux-model and try them out in your own cluster.
You can find a simplified model of client failover in https://github.com/tgross/nomad-yamux-model to help you understand the expected behavior better.

Background

The goal here was to verify our own understanding of how the clients behave, and to produce a simplified model application that can exercise the same behaviors. If the simplified model behaves under test the same way that the real servers do, then we can say the model is validated as a model (at least with respect to the specific behavior under test). This can be useful for both expert Nomad users and Nomad developers who are trying to reason through what they're seeing, because they can play with the simplified model to demonstrate the behaviors without having to run it on a real cluster. This should be considered a supplement to actually reading the code. :grinning:

I examined 4 different scenarios, each with 3 different test cases. My expectation was that when the client can no longer reach its server, that it would promptly detect this and switch to another server. The following table summarizes the results:

Scenario #	Description	Server exited gracefully	Server SIGKILL	Netsplit
1	Model servers localhost	immediate failover	immediate failover	hangs until yamux keepalive
2	Model servers separate VMs	immediate failover	immediate failover	hangs until yamux keepalive
3	Real Nomad separate VMs	immediate failover	immediate failover	hangs until yamux keepalive
4	Real Nomad separate EC2 hosts with mTLS	immediate failover	immediate failover	hangs until yamux keepalive

Scenarios 1 & 2: Model Servers

Scenario 1 and 2 used a model application, designed to simulate the Nomad server's yamux accept loop and handlers, as well as the expected Nomad client behavior. The source code can be found at https://github.com/tgross/nomad-yamux-model and is intended to be small enough in scope to be readable by a non-expert.

Each server binds to a particular address and port. The client is provided with a list of all the servers. The client pick the first server in the list, and then:

Opens a TCP connection to it
Opens a yamux session to it
Then, once per second:
- Open a yamux stream
- Send a "ping" request
- Read for the response

If the heartbeat fails for any reason, the client rotates its list of servers and starts over with a new TCP connection.

Scenario 1 used 2 servers and 1 client running on the same host. Scenario 2 uses 3 servers and 1 client, running on separate Linux VMs (Ubuntu 20.04LTS, "stock" networking tunables). The behaviors are identical in both cases.

Test Case 1

In the first test case, the server is shutdown gracefully. The server receives 1 ping before its shutdown is complete and it closes the stream. The client detects this situation as a failure to send because the session has been gracefully shut down, and immediately transfers to the next server.

logs: scenario 1 and 2, test case 1

First server shutdown:

$ ./bin/yamux server 127.0.0.1:7078
2022/10/10 15:04:21 listening for incoming connections on 127.0.0.1:7078
2022/10/10 15:04:53 pong
...
2022/10/10 15:04:58 pong
^C2022/10/10 15:04:59 shutting down in 1sec...
2022/10/10 15:04:59 pong
2022/10/10 15:05:00 bye!

The second server receives a ping as scheduled:

$ ./bin/yamux server 127.0.0.1:7079
2022/10/10 15:04:34 listening for incoming connections on 127.0.0.1:7079
2022/10/10 15:05:01 pong
2022/10/10 15:05:02 pong

Client fails over as expected:

$ ./bin/yamux client 127.0.0.1:7078 127.0.0.1:7079
2022/10/10 15:04:53 wrote ping...
2022/10/10 15:04:53 got pong!
...
2022/10/10 15:04:58 wrote ping...
2022/10/10 15:04:58 got pong!
2022/10/10 15:04:59 wrote ping...
2022/10/10 15:04:59 got pong!
2022/10/10 15:05:00 send error: failed to open session: session shutdown, reconnecting to next server
2022/10/10 15:05:01 wrote ping...
2022/10/10 15:05:01 got pong!

Test Case 2

In test case 2, the server does not exit gracefully and instead is sent SIGKILL. The behavior of the client is identical to test case 1, because the OS closes the TCP connection even if the application is no longer around to do so. Yamux handles this case gracefully, as one would expect.

Test Case 3

In test case 3, the server is netsplit via iptables. In both scenario 1 and 2, incoming packets are dropped before hitting the server. Remember that in yamux, the client can open a stream and immediately send the RPC without waiting for a packet to come back. In this test case, we see that the client opens the stream, sends the RPC, and then hangs waiting for the reply. This state will last until the yamux keepalive timeout expires (by default 30s, as well as in our model application). At this point, the client closes the stream and immediately transfers to the next server.

logs: scenario 1 and 2, test case 3

Client fails over after keepalive expires:

2022/10/10 19:30:20 got pong!
2022/10/10 19:30:21 wrote ping...
2022/10/10 19:30:52 [ERR] yamux: keepalive failed: i/o deadline reached
2022/10/10 19:30:52 send error: read error: EOF
, reconnecting to next server
2022/10/10 19:30:53 wrote ping...
2022/10/10 19:30:53 got pong!
2022/10/10 19:30:54 wrote ping...

Real Nomad Servers

In Scenario 3 and 4, the same test cases were run against real Nomad clusters. Scenario 3 was with 3 Nomad servers with 4 Nomad clients, all running on separate Vagrant VMs. Scenario 4 was with 3 Nomad servers and 1 Nomad client, running on AWS EC2. Both scenarios were with Ubuntu 20.04 LTS with "stock" networking tunables.

To monitor the test targets without interfering with timing, the tests used bpftrace scripts that record timestamps and arguments. The source for these scripts can be found at https://github.com/tgross/nomad-yamux-model

Installing bpftrace

For a very quick-and-dirty semi-portable install of bpftrace you can do the following. For the full suite of bpftrace tools, you'll need Linux kernel headers and you'll want the example tools from their repo.

sudo docker run -v $(pwd):/output \
    quay.io/iovisor/bpftrace:master-vanilla_llvm_clang_glibc2.23 \
    /bin/bash -c "cp /usr/bin/bpftrace /output"
sudo mv ./bpftrace /usr/bin/bpftrace

On the leader, the test traced the node heartbeat's resetHeartbeatTimer method (ref nomad/heartbeat.go#L90, which is called every time a node heartbeats. The trace program is as follows:

#!/usr/bin/env bpftrace

uprobe:/usr/local/bin/nomad:"github.com/hashicorp/nomad/nomad.(*nodeHeartbeater).resetHeartbeatTimer"
{
    printf("[%s] got heatbeat for %s\n",
                 strftime("%H:%M:%S", nsecs),
                 str(reg("bx"), reg("cx"))
                 );
}

Note on addresses

Note that the register addresses here are obtained by disassembling the method. It's possible that different versions of Nomad may need different register addresses.

objdump --disassemble="github.com/hashicorp/nomad/nomad/nomad.(*nodeHeartbeater).resetHeartbeatTimer" $(which nomad) -S -Mintel

This script produces output like:

$ sudo bpftrace ./server.bt
Attaching 1 probe...
[13:55:07] got heatbeat for 461190cc-3265-ab10-9174-0ba0cef19c59
[13:55:08] got heatbeat for b171b848-45ca-a7b5-f8b9-85648658cb58
[13:55:16] got heatbeat for 2a6dffad-280f-3ab6-38f6-d67a15bdbc00
[13:55:17] got heatbeat for 07758832-9a07-2572-f19b-49a79adbd40c
[13:55:22] got heatbeat for b171b848-45ca-a7b5-f8b9-85648658cb58
[13:55:25] got heatbeat for 461190cc-3265-ab10-9174-0ba0cef19c59
[13:55:28] got heatbeat for 07758832-9a07-2572-f19b-49a79adbd40c
[13:55:30] got heatbeat for 2a6dffad-280f-3ab6-38f6-d67a15bdbc00

On the client, we're interested in client.(*Client).updateNodeStatus which triggers the heartbeat, and helper/pool.(*ConnPool).RPC which sends all RPCs. The client trace program is as follows:

#!/usr/bin/env bpftrace
#include <linux/socket.h>

struct ipv4 {
    uint8_t bytes[16];
};

struct TcpAddr {
  struct ipv4* addr;
  long len;
  long cap;
  int port;
};

uprobe:/usr/local/bin/nomad:"github.com/hashicorp/nomad/client.(*Client).updateNodeStatus"
{
    printf("[%s] updating node status\n",
                 strftime("%H:%M:%S", nsecs)
                 );
}

uprobe:/usr/local/bin/nomad:"github.com/hashicorp/nomad/helper/pool.(*ConnPool).RPC"
{
    $addr = (struct TcpAddr *) reg("si");
    $bytes = $addr->addr->bytes;

    printf("[%s] sending %s RPC to addr=%d.%d.%d.%d:%d in region=%s\n",
                 strftime("%H:%M:%S", nsecs),
                 str(reg("r8"), reg("r9")),
                 $bytes[12], $bytes[13], $bytes[14], $bytes[15], $addr->port,
                 str(reg("bx"), reg("cx"))
                 );
}

When running on a client, this script produces output like the following:

$ sudo bpftrace ./client.bt
Attaching 2 probes...
[17:07:46] updating node status
[17:07:46] sending Node.UpdateStatus RPC to addr=192.168.56.30:4647 in region=global
[17:08:02] updating node status
[17:08:02] sending Node.UpdateStatus RPC to addr=192.168.56.30:4647 in region=global

Test Case 1 and 2

In both these test cases, the client is connected to a non-leader server. That server is either shut down gracefully or sent a SIGKILL. In both cases, the client transfers to the next server for the next heartbeat, because the yamux client receives a close.

[18:01:01] sending Node.UpdateStatus RPC to addr=192.168.56.10:4647 in region=global
[18:01:12] sending Node.GetClientAllocs RPC to addr=192.168.56.20:4647 in region=global
[18:01:20] updating node status
[18:01:20] sending Node.UpdateStatus RPC to addr=192.168.56.20:4647 in region=global

Test Case 3

In this test case, the non-leader server is netsplit via iptables. Like with the model application, one heartbeat RPC hangs waiting for a response up to the yamux keep-alive, until it fails over to the next server. In the trace output below, traffic was blocked at 18:42:08 (but it looks like one request slipped in before the filter was set):

[18:41:58] updating node status
[18:41:58] sending Node.UpdateStatus RPC to addr=172.31.84.142:4647 in region=global
[18:42:09] updating node status
[18:42:09] sending Node.UpdateStatus RPC to addr=172.31.84.142:4647 in region=global
[18:42:35] sending Status.Peers RPC to addr=172.31.80.81:4647 in region=global
[18:42:35] sending Node.GetClientAllocs RPC to addr=172.31.80.81:4647 in region=global
[18:42:37] updating node status
[18:42:37] sending Node.UpdateStatus RPC to addr=172.31.80.81:4647 in region=global

The Nomad client logs for this period show the yamux keepalive failed, just as was seen in the model application:

2022-10-10T18:42:35.349Z [ERROR] client: yamux: keepalive failed: i/o deadline reached
2022-10-10T18:42:35.349Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.UpdateStatus server=172.31.84.142:4647
2022-10-10T18:42:35.349Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: EOF" rpc=Node.UpdateStatus server=172.31.84.142:4647
2022-10-10T18:42:35.350Z [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: EOF" period=1.880397295s
2022-10-10T18:42:35.349Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=172.31.84.142:4647
2022-10-10T18:42:35.351Z [DEBUG] client.consul: bootstrap contacting Consul DCs: consul_dcs=["nomad-e2e-shared-hcp-consul"]
2022-10-10T18:42:35.367Z [INFO]  client.consul: discovered following servers: servers=[172.31.86.9:4647, 172.31.80.81:4647, 172.31.84.142:4647]
2022-10-10T18:42:37.236Z [DEBUG] client: state updated: node_status=ready
2022-10-10T18:42:37.236Z [WARN]  client: missed heartbeat: req_latency=5.546527ms heartbeat_ttl=10.634447865s since_last_heartbeat=38.292386425s

The Nomad server logs show that the TTL expired:

2022-10-10T18:42:36.642Z [WARN]  nomad.heartbeat: node TTL expired: node_id=461190cc-3265-ab10-9174-0ba0cef19c59

Oct 10 '22 20:10 tgross

I've got this as an issue mostly so we can point external folks to it and because we might have a small amount of associated docs that can come from here. But I don't know that there's anything actionable outside of docs.

Oct 10 '22 20:10 tgross

Hey @tgross thanks a lot for this detailed scenario testing. We seem to have run into scenario 3 (at least that's what the logs indicate) when hard resetting a Nomad server. Is it possible that SIGKILL-ing Nomad server in scenario 3 leads to a proper failover because the OS immediately shuts down the TCP connection, but in case a hard server poweroff occurs that doesn't happen and the outcome would be similar to a netsplit because the TCP connection turns stale?

Dec 08 '22 10:12 ygersie

Hi @ygersie! That seems likely, and is pretty much the difference between scenario 2 and scenario 3 -- the OS never gets a chance to send a close packet (at either the TCP or yamux layer).

Just by way of follow-up on this issue, a couple of us been working on-and-off on this including a node simulator we can use to simulate large clusters, and that may lead to some improvements to the heartbeating if we can get some time to focus on it after the new year (other priorities allowing, of course).

Dec 08 '22 14:12 tgross

@tgross That'd be great as it's definitely not ideal to have a timeout on the yamux layer which can lead to clients being marked as down. Wouldn't it make sense to ensure the hearbeat timeout is always lower than the min_heartbeat_ttl though?

Dec 08 '22 20:12 ygersie

nomad nomad copied to clipboard

empirical model of Nomad client failover

Background

Scenarios 1 & 2: Model Servers

Test Case 1

Test Case 2

Test Case 3

Real Nomad Servers

Test Case 1 and 2

Test Case 3

nomad
nomad copied to clipboard