pulumi-cloud cloud.Service update takes a long time (~8 minutes)

trafficstars

In the PPC program cloud.Service, we see the following on updates:

2017-12-15 17:26:13 -0800 service has reached a steady state.
2017-12-15 17:25:51 -0800 service has stopped 1 running tasks: task 7891c7d6-8bb3-43d4-a33b-1066e577a65e.
2017-12-15 17:22:44 -0800 service has begun draining connections on 1 tasks.
2017-12-15 17:22:44 -0800 service deregistered 1 targets in target-group pulumi-te-ni0cce5b1af
2017-12-15 17:18:05 -0800 service registered 1 targets in target-group pulumi-te-ni0cce5b1af
2017-12-15 17:17:34 -0800 service has started 1 tasks: task c1f0173a-5b79-4116-965f-dee9a7ee00d9.
2017-12-15 17:00:07 -0800 service has reached a steady state.

This takes 519 seconds from update start to update steady state.

There is a 279 second delay between registering the new tasks and deregistering the old tasks. Then a 187 second delay between deregistering the old task and stopping it. The latter is due to deregistrationDelay being set to 180 seconds. The former though is a very long wait, during which both services are potentially reachable.

Contrast that with the Pulumi API Service, which is built on ECS but does not use the cloud.Service abstraction - where we see this:

2017-12-15 16:43:42 -0800 service has reached a steady state.
2017-12-15 16:43:17 -0800 service has stopped 2 running tasks: task ce1d3329-fc1e-4e73-8532-be4488d9fc6c task a5b4a734-0903-4881-97a8-e62538f10856.
2017-12-15 16:42:41 -0800 service has begun draining connections on 2 tasks.
2017-12-15 16:42:41 -0800 service deregistered 2 targets in target-group apiTG66f890d7
2017-12-15 16:42:17 -0800 service registered 2 targets in target-group apiTG66f890d7
2017-12-15 16:42:06 -0800 service has started 2 tasks: task 948a0462-184f-40c2-8b76-ce0cd085578d task 40ff51f0-e153-4b70-9c10-fd406a338964.
2017-12-15 12:28:44 -0800 service has reached a steady state.

This takes only 96 seconds from update start to update steady state.

Notably, there is always exactly 24 seconds between registering the new targets and deregistering the old targets - which is great. Then there is just over 30 seconds between deregistering the old task and stopping it. The latter is due to deregistrationDelay being set to 30 in the Pulumi API Service scripts, which explains the difference there. But for the former, there does not appear to be any configuration which controls this, and it's unclear why there is such a big difference between these two cases.

My guess is that this is somehow related to waiting for health checks to pass on the new service before deregistering the old service. And perhaps the health checks are taking longer on the PPCD process?

Also, the PPCD process is using an NLB, while the API service is using an ALB, which may be a source of difference here.

Dec 16 '17 18:12 lukehoban

Now that we wait for service stable by default, this 8min delay is felt on every update to a cloud.Service. This is more correct, but does harm inner loop development somewhat significantly.

It doesn't feel like this should be necessary, and the example above from the pulumi-service repo suggests that it isn't for some similar but slightly different configuration of ECS/[A|N]LB.

I think we need to figure out what's going on here and whether there is any way to tighten this update cycle.

Mar 10 '18 23:03 lukehoban

I wondered if this might relate to health checks and healthy intervals.

Here's update events for a cloud.Service-backed ECS service:

here's the accompanying graph of "Healthy Hosts" from the target group:

The target group sees four healthy tasks as of 19:19 -0700. Healthchecks probably aren't the answer.

Mar 20 '18 08:03 mmdriley

The two new tasks are created at 19:18:29. By 19:18:36, both tasks have logged Serving PPC API on [::]:8080.

Mar 20 '18 08:03 mmdriley

I happened to catch a service deployment while investigating this issue.

Here are the deployment events:

and here's the load balancer target listing as of 13:03:28 PDT:

Meanwhile the "Healthy Hosts" graph in CloudWatch goes to 4 as of 13:01 PDT!

So the gap here is explained by tasks in the "initial" state on the loadbalancer.

Mar 20 '18 20:03 mmdriley

Unanswered question where someone else sees similar behavior: https://stackoverflow.com/questions/47256085/aws-network-elb-take-4-minutes-to-recognise-target-as-healthy

Mar 20 '18 20:03 mmdriley

Filed a support request to ask AWS what causes a target to move from initial to healthy.

Mar 20 '18 20:03 mmdriley

Reply from AWS support, emphasis mine:

From the case notes, I understand that the targets behind the NLB are taking approximately 275 seconds to move to healthy state from initial state and you wish to know when the instances transition from initial state to healthy state.

This is a known issue, where a newly registered instance remains in initial state for longer period of time and our internal team is already working on the fix for this issue. Unfortunately, at this point in time we do not have an ETA for the fix.

The instance should move to healthy state once it passes the 3 health checks probes (since you have confirmed healthy threshold as 3). However, when I replicated the issue, I can see that there is a delay in first health check request on the backend.

apologize for the inconvenience the above issue might have caused you and hope that the above information was helpful. If you have any further questions/concerns, feel free to update the case and I will get back to you.

Mar 21 '18 06:03 mmdriley

Hmm - good to know it's a bug and not by design - but unfortunate that they haven't identified any approach to working around this.

Is the assumption that this is just for NLB-fronted services? I assume if we were using ELB or ALB we wouldn't see this?

Mar 21 '18 07:03 lukehoban

I assume if we were using ELB or ALB we wouldn't see this?

That matches my understanding too.

Mar 21 '18 07:03 mmdriley

@mmdriley Any action you think we can/should take here?

Apr 17 '18 04:04 lukehoban

Now that most pulumi-service deployments skip updating the PPC, we're feeling a lot less pain from this. There are also no obvious fixes. We may want to spend effort here if circumstances change -- if we get feedback from customers or start relying on cloud.Service more day-to-day -- but for now I don't think we want to spend energy here.

Apr 17 '18 22:04 mmdriley

Reopening - as we still see this constantly and it leads to very long time for Service updates to complete - seemingly significantly longer than strictly needed.

May 23 '18 01:05 lukehoban

Is there anything actionable we plan to do here before ship?

Jun 04 '18 15:06 joeduffy

I would love to spend some time looking at what options we have to improve the performance here. This is one of the most significant pain points of working with containers in Pulumi today, due to the very long delay on any updates to the container by default. I believe this is not actually the expected/intended behaviour of the underlying services we are depending on, and that the combination and parameterization we are using just leads to a particularly long delay. If we could cut this in half, it would have a big impact on interactive development of container-based Pulumi programs.

That said, we have no concrete known action here yet, and it may be risky to make any change in the near term. Ideally we would make some progress on understanding options and see if there is anything low-hanging and low-risk we could do.

Jun 04 '18 21:06 lukehoban

pulumi-cloud pulumi-cloud copied to clipboard

cloud.Service update takes a long time (~8 minutes)

pulumi-cloud
pulumi-cloud copied to clipboard