nomad Register Consul Service and Checks atomically

Context

We use Nomad with Consul and Docker to deploy Grpc services to a cluster. We use Linkerd as the loadbalancer, it uses Consul to get the availability information of the services.

Scenario

We observe transient errors at the consumer when a new instance of the service is getting started during the service update with Nomad. (Tcp health check is used to avoid routing request to it when the Docker container has started and the service inside is not ready yet to receive requests. Linkerd is configured to only consider services with passing health check status)

Expected behavior

Being able to update the services seamlessly with the above setup.

Versions

Nomad v0.6.3 Consul v1.0.3 Linkerd v1.3.5

Analysis

Linkerd uses blocking queries to get availability info from Consul. Example: /v1/health/service/testservice?index=1366&dc=dc1&passing=true I captured the network traffic and this type of query returns the new service instance without the Tcp health check being defined first, then shortly afterwards it returns it the with Tcp health check. So it seems Consul considers the service “passing” when it does not have health checks defined, and Nomad first registers the new service instance and then it registers its health checks in a separate call. The captured network packets (Nomad -> Consul) confirms that it happens separately:

PUT /v1/agent/service/register HTTP/1.1
Host: 127.0.0.1:8500
User-Agent: Go-http-client/1.1
Content-Length: 165
Accept-Encoding: gzip
{"ID":"_nomad-executor-47313988-2a65-66e0-46af-491023330cca-session-testservice",
"Name":"testservice","Port":27123,"Address":"10.0.75.1","Check":null,"Checks":null}

HTTP/1.1 200 OK
Date: Fri, 02 Mar 2018 14:33:31 GMT
Content-Length: 0
Content-Type: text/plain; charset=utf-8

PUT /v1/agent/check/register HTTP/1.1
Host: 127.0.0.1:8500
User-Agent: Go-http-client/1.1
Content-Length: 233
Accept-Encoding: gzip
{"ID":"942d6c5b20bea9520f70ced61336a2987bf9c530","Name":"service: \"testservice\" check",
"ServiceID":"_nomad-executor-47313988-2a65-66e0-46af-491023330cca-session-testservice",
"Interval":"10s","Timeout":"2s","TCP":"10.0.75.1:27123"}

HTTP/1.1 200 OK
Date: Fri, 02 Mar 2018 14:33:31 GMT
Content-Length: 0
Content-Type: text/plain; charset=utf-8

Nomad should register the service and its health checks in one call in Consul, otherwise the new service instance is considered healthy even before Nomad registers its health check, I believe.

Mar 05 '18 11:03 gahebe

we use linkerd with nomad too, it works fine

https://www.nomadproject.io/docs/job-specification/service.html#initial_status set initial state to critical, shouldn't be healthy until your app says so
put something like 5-10s in your task shutdown_delay https://www.nomadproject.io/docs/job-specification/task.html#shutdown_delay - so linkerd forgets about the alloc before the kill signal is even sent to the process
make sure your task kill_signal ( https://www.nomadproject.io/docs/job-specification/task.html#kill_timeout ) is sufficiently long for your task to drain on-going results
make sure your client config has a high enough max kill timeout https://www.nomadproject.io/docs/agent/configuration/client.html#max_kill_timeout
success

Mar 05 '18 11:03 jippi

@jippi Thanks for your help. In our case the shutdown part is not the problem, we confirmed it works fine (we use shutdown_delay, ...). The problem is the new instance is considered healthy before the health check gets registered. We can almost always reproduce that under heavy load. I believe that setting initial_status to critical will not change this for two reasons:

This is a property on the health check itself, and our problem is Nomad registers the health check separately bit later.
If you check the second PUT request (in the first comment) that Nomad sends to Consul, you can see that we do not specify the initial_status in the job specification, and in that case it defaults to critical on Consul side

Mar 05 '18 12:03 gahebe

Thanks for reporting this @gahebe. We are aware of this and are tracking fixing this in a future release.

Mar 05 '18 20:03 preetapan

Did you ever find a work-around for this issue @gahebe? We have the same problem, and can't afford the downtime on every deployment.

Apr 29 '20 06:04 madsholden

Has this issue been roadmapped @tgross?

I haven't found a workaround for this issue. I'm really wondering how others are dealing with this. We're using Traefik as a dynamic load balancer, and it often manages to route a request or two to the newly started services before they are marked unhealthy by Consul. We've tried using Traefik health checks in addition to the ones in Consul, but unfortunately new services are marked as healthy before the first health check is done, leaving us with the same problem.

Since we're trying to not lose a single request during deployments, we're unfortunately considering migrating away from Nomad.

Mar 15 '21 13:03 madsholden

Has this issue been roadmapped @tgross?

It's been removed from the community issues "needs roadmapping" board to our roadmap, which is unfortunately not yet public. It has not been targeted for the upcoming Nomad 1.1. Nomad 1.2 hasn't been scoped yet. We'd certainly welcome a pull request if this is a show-stopper for you, although it's admittedly a squirrelly area of the code base.

I haven't found a workaround for this issue. I'm really wondering how others are dealing with this. We're using Traefik as a dynamic load balancer, and it often manages to route a request or two to the newly started services before they are marked unhealthy by Consul.

I'm not an expert in traefik, but doesn't it have a polling configuration value for Consul? The race here is fairly short; you might be able to reduce the risk of it (but not eliminate it entirely) by increasing that polling window. I recognize this is a workaround rather than a fix. In any case, I'd expect a production load balancer to be able to handle a one-off dropped request if only because of CAP; there might be some configuration values in traefik you need to tweak here.

Mar 18 '21 12:03 tgross

How hard/breaking would it be to go with consul/api.Agent.ServiceRegisterOpts in the first call, meaning it would already contain the checks to be created, then perform the necessary cleanup. I've got the impression that this call if fairly recent, Consul v1.7 and thus would introduce some breaking changes in Nomad itself.

Cheers,

Mar 25 '21 19:03 greut

Call me stupid, but I tried this locally:

diff --git a/command/agent/consul/service_client.go b/command/agent/consul/service_client.go
index 66c00225c..b8340b062 100644
--- a/command/agent/consul/service_client.go
+++ b/command/agent/consul/service_client.go
@@ -941,6 +941,7 @@ func (c *ServiceClient) serviceRegs(ops *operations, service *structs.Service, w
                Meta:              meta,
                Connect:           connect, // will be nil if no Connect stanza
                Proxy:             gateway, // will be nil if no Connect Gateway stanza
+               Checks:            make([]*api.AgentServiceCheck, 0, len(service.Checks)),
        }
        ops.regServices = append(ops.regServices, serviceReg)
 
@@ -952,6 +953,23 @@ func (c *ServiceClient) serviceRegs(ops *operations, service *structs.Service, w
        for _, registration := range checkRegs {
                sreg.checkIDs[registration.ID] = struct{}{}
                ops.regChecks = append(ops.regChecks, registration)
+               serviceReg.Checks = append(serviceReg.Checks, &api.AgentServiceCheck{
+                       CheckID:                registration.ID,
+                       Name:                   registration.Name,
+                       Status:                 registration.Status,
+                       Timeout:                registration.Timeout,
+                       Interval:               registration.Interval,
+                       SuccessBeforePassing:   registration.SuccessBeforePassing,
+                       FailuresBeforeCritical: registration.FailuresBeforeCritical,
+                       TLSSkipVerify:          registration.TLSSkipVerify,
+                       HTTP:                   registration.HTTP,
+                       Method:                 registration.Method,
+                       Header:                 registration.Header,
+                       TCP:                    registration.TCP,
+                       TTL:                    registration.TTL,
+                       GRPC:                   registration.GRPC,
+                       GRPCUseTLS:             registration.GRPCUseTLS,
+               })
        }
 
        return sreg, nil

and it immediately registered the service with the checks. This seems to work (at least as verified by wireshark and & the nomad source) because ServiceClient.sync handles services first, then fetches the checks and only then creates/removes checks. Due to the fact that the registration IDs match the check is just created once. @greut do you think you could try that? There is also no need to remove any other code, because the other code is still needed if a check changes etc…

Mar 26 '21 18:03 apollo13

How hard/breaking would it be to go with consul/api.Agent.ServiceRegisterOpts in the first call, meaning it would already contain the checks to be created, then perform the necessary cleanup.

I'm pretty sure Consul supports putting the check in the service registration back to at least 1.0, because I have a project from an old job that does that and just checked the API version there. So I don't think it's a matter of the API not supporting it.

I had an internal discussion with some folks and it looks like a lot of the reasoning for how this was designed was about optimizing writes to Consul's raft DB, but it turns out that Consul already takes care of all of that performance worry under the hood. So now it's just a matter of getting it done. I think the approach @apollo13 has here is promising, but I'd want to make sure that it still works on updates, not just the initial registration.

Mar 26 '21 19:03 tgross

I did change the service check and that updated properly. I did not check any other changes to the service, but looking through the code I am confident that this approach should work. It might (?) be worth to look into transactions though to make this an all or nothing sync. But then again I am not sure if this makes sense or not

On Fri, Mar 26, 2021, at 20:25, Tim Gross wrote:

How hard/breaking would it be to go with consul/api.Agent.ServiceRegisterOpts in the first call, meaning it would already contain the checks to be created, then perform the necessary cleanup.

I'm pretty sure Consul supports putting the check in the service registration back to at least 1.0, because I have a project from an old job that does that and just checked the API version there. So I don't think it's a matter of the API not supporting it.

I had an internal discussion with some folks and it looks like a lot of the reasoning for how this was designed was about optimizing writes to Consul's raft DB, but it turns out that Consul already takes care of all of that performance worry under the hood. So now it's just a matter of getting it done. I think the approach @apollo13 https://github.com/apollo13 has here is promising, but I'd want to make sure that it still works on updates, not just the initial registration.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/issues/3935#issuecomment-808460622, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAT5C5JBYAQROENJUB6URDTFTNTFANCNFSM4ETQNL7Q.

Mar 26 '21 19:03 apollo13

See also https://github.com/hashicorp/nomad/issues/10482

Apr 30 '21 18:04 tgross

We seem to have this issue as well.

May 25 '21 23:05 robloxrob

I'm not an expert in traefik, but doesn't it have a polling configuration value for Consul? The race here is fairly short; you might be able to reduce the risk of it (but not eliminate it entirely) by increasing that polling window. I recognize this is a workaround rather than a fix. In any case, I'd expect a production load balancer to be able to handle a one-off dropped request if only because of CAP; there might be some configuration values in traefik you need to tweak here.

Unfortunately neither Traefik nor HAProxy (which we also tried) are able to work around this issue directly. They can both be configured with their own health checks, but newly discovered services are by default healthy, which makes it difficult.

The workaround we're using now is having a custom http proxy between Traefik and Consul. Using the http provder in Traefik, and having the proxy only forward services that have a registered health check, works. But it is far from ideal.

Jun 09 '21 09:06 madsholden

@madsholden Could you test my patch? The more people testing it the sooner it will probably get merged.

Jun 09 '21 10:06 apollo13

We are running into a similar issue.

Our Setup: Waypoint Nomad Consul Traffik

We are doing Canary deployments, and no matter what we set the shutdown_delay, it seems that when the old task is being deregistered we are seeing blips where we get 502 going to the application. Originally, we thought shutdown_delay would resolve the issue but it almost seems like there is a split second/seconds where there is no config for the traffic to the proper app.

Jul 11 '22 15:07 rlandingham

@rlandingham could you test my patch (see a few comments above)? If it fixes it for you a PR would certainly be the best step forward.

Aug 02 '22 12:08 apollo13

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Feb 17 '23 02:02 github-actions[bot]