containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

ECS Service Discovery not respecting TTL when updating service

Open matthewduren opened this issue 7 years ago • 43 comments

Summary

When updating a service or otherwise scaling out ECS tasks for a service that uses Service Discovery, tasks are being stopped before reaching the TTL of the service discovery record(s).

Description

When updating a service or otherwise scaling out ECS tasks for a service that uses Service Discovery, tasks are being stopped before reaching the TTL of the service discovery record(s).

To reproduce - create an ECS service from a simple "hello world" type task definition that runs forever and does nothing. Set min healthy to 100, max to 200, count to 1. Setup service discovery and create a DNS record with a long TTL, say 300s. Update the service to use a new revision of the task definition (no changes to the task def needed), and note that the tasks are stopped before the TTL time is reached.

Expected Behavior

ECS Agent should remove the route53 record(s), and then wait to stop the tasks after the TTL duration has elapsed.

Observed Behavior

ECS Agent does not wait any additional time when stopping tasks for services that use service discovery.

Environment Details

Supporting Log Snippets

matthewduren avatar Oct 04 '18 18:10 matthewduren

The current behavior is outrageous, that such flaw exists in AWS service. It is forcing us to use ELB/ALB and to increase unnecessary costs, besides performance impacts.

himberjack avatar Nov 03 '18 18:11 himberjack

There should also be the option for it to disappear from Route53 when it starts draining, not just enough time for TTL. As requests can come in at the end of TTL, and there not be enough time to process them.

AndrewLugg avatar Nov 19 '18 21:11 AndrewLugg

Right, when it starts draining it should immediately remove the r53 record. After the TTL has elapsed the normal SIGTERM signal should be sent to the container, followed by SIGKILL 30 seconds later if the task is still up just like how tasks that don't use service discovery behave.

matthewduren avatar Nov 19 '18 22:11 matthewduren

Was hoping to avoid using an lb and this came to mind, sad to see it's an unresolved issue :(

melbourne2991 avatar Mar 21 '19 19:03 melbourne2991

I'm on the same issue.

I am building an infrastructure for gRPC services by using ECS Fargate and it's service discovery feature, without having ELB. The communication between services is transferred by Envoy proxy. Each envoy listener is served via ECS' service discovery.

I got some of the gRPC unavailable error during updating the service. The envoy that is transferring a request to other envoy could lose all the upstream connections since there is a moment that the envoy only knows old container's ip addresses, which are already dead by SIGTERM. As workaround, I configured the DNS TTL as very short value such as 3s, though I got still errors in a short period of time(about in 10 seconds).

I hope that the issue will be resolved.

nikushi avatar Mar 29 '19 11:03 nikushi

I hope that the issue will be resolved too :(

kuongknight avatar Apr 15 '19 15:04 kuongknight

+1.

afawaz2 avatar Apr 17 '19 19:04 afawaz2

FYI: If you put the minimum healthy to 0% & max to 100%; i.e stop everything before starting new instances, your service is unreachable for several minutes due to negative DNS caching: I've been experimenting with a service that I really always only want one instance of running, this is what a restart looks like:

  Event Time in second since stop
task stop 0
task start 22
dns gone 26
service listening 35
ecs ready 82
dns back 270

For roughly 4 minutes the service is ready to accept connections, but the DNS returns a NXDOMAIN. So don't try to use Service Discovery for this purpose. Also note that the VPC dns resolver does not adhere to the 24h TTL set in the SOA record for the service discovery dns zone. But you cannot change that TTL anyway, so I guess we should be happy with that and the service not being unreachable for 24h.

Though I'd mention this caveat here since this is where I ended up researching SD ttls.

holstvoogd avatar Jul 26 '19 13:07 holstvoogd

We are facing same issue, any updates on this?

thanks 😄

victor-paddle avatar Apr 28 '20 14:04 victor-paddle

We are facing same issue, any updates on this?

thanks 😄

joeke80215 avatar May 29 '20 04:05 joeke80215

The issue still exists.

2mositalebi avatar Sep 08 '20 09:09 2mositalebi

There should also be the option for it to disappear from Route53 when it starts draining, not just enough time for TTL. As requests can come in at the end of TTL, and there not be enough time to process them.

In addition to this, the instance health should change to "unhealthy" so that any api based calls does not see the instance as healthy. Similar to "deregistration delay" in target groups. Also discussed here: https://github.com/aws/containers-roadmap/issues/473

Related:

  • https://github.com/aws/containers-roadmap/issues/1039
  • Allow configuring envoy connection draining: https://github.com/aws/aws-app-mesh-roadmap/issues/252
  • Current vs desired behaviour: https://github.com/spinnaker/spinnaker/issues/5542#issuecomment-595732111

awsiv avatar Sep 30 '20 15:09 awsiv

Having the same issue with our gRPC server in ECS service and service discovery. Also, we are utilizing spot instances for the service. The gRPC clients cannot call the gRPC server when there is a spot instance interruption, although ECS has spawned a new task before the current task stopped.

Hope this issue will be fixed soon.

rilutham avatar Oct 08 '20 16:10 rilutham

We are facing same issue :(

kuongknight avatar Oct 12 '20 03:10 kuongknight

We are facing same issue :)

hgsgtk avatar Nov 16 '20 02:11 hgsgtk

Reading through the linked issue, that bug is related to not respecting TTLs. The bug we fixed in ECS was an ordering issue where some tasks may be stopped before new tasks are actively visible in DNS.

https://github.com/aws/aws-app-mesh-roadmap/issues/151

hgsgtk avatar Nov 16 '20 02:11 hgsgtk

We are facing the same issue. I created a tool that let us graph the behavior. But basically I've seen the HTTP 503 errors show up AFTER ECS is done deploying new tasks and after the old tasks are shutdown. The Y-axis in the below graph are HTTP status code. Ignore the fact that my service was returning a 403. I wasn't providing a token but that is unrelated to this point. aesyay

CraigHead avatar Nov 16 '20 21:11 CraigHead

I noticed during an ECS Fargate deployment ServiceDiscovery will return a empty array for a short period of time e.g. aws servicediscovery discover-instances --namespace-name my-namespace --service-name MyService

{
    "Instances": []
}

CraigHead avatar Nov 16 '20 21:11 CraigHead

@CraigHead can you provide details on your service's configuration specifically

  1. desiredCount
  2. maximumPercent
  3. minimumHealthyPercent

I can confirm that when desiredCount=1, maximumPercent=100, minimumHealthyPercent=0 there will be 5xx as expected. Note that this is not a recommended setting to use when working with ECS service discovery.

kiranmeduri avatar Sep 07 '21 21:09 kiranmeduri

@kiranmeduri I spoke with my AWS SA and at his request I opened a support case in May. Without going into exhaustive detail, a flaw was found in the integration between CloudMap and API Gateway for ECS service resolution during deployments. That fix was deployed last week and I confirmed HTTP 5xx errors are no longer happening for about 40 seconds AFTER a deployment occurs and ECS stabilizes.

CraigHead avatar Sep 07 '21 22:09 CraigHead

This is still a problem (on ECS instance types at least, I haven't tested fargate). Is there any progress? As it stands it makes this feature unviable for production which is a real shame😢

marc-costello avatar Sep 23 '21 08:09 marc-costello

same problem does exists on fargate as well

pgeler avatar Nov 01 '21 16:11 pgeler

Any update on this? I have been dealing with this issue on Fargate for years.

Why doesn't ECS service discovery kill the DNS records for draining instances?

false-vacuum avatar Feb 17 '22 16:02 false-vacuum

Any update on this? I have been dealing with this issue on Fargate for years.

Why doesn't ECS service discovery kill the DNS records for draining instances?

I can only reply to this question with the very first comment on this issue.

The current behavior is outrageous, that such flaw exists in AWS service. It is forcing us to use ELB/ALB and to increase unnecessary costs, besides performance impacts.

I've been dealing with this issue on Fargate for a few months and just now discovered that it is an old problem 😠

pablodiegoss avatar Feb 21 '22 19:02 pablodiegoss

Is there any chance this can be fixed? It's really the only reason stopping us from using service discovery with ECS.

I assume eks does its own service discovery which is why this is so low priority.

chrisburrell avatar Mar 12 '22 21:03 chrisburrell

Still facing this issue. We have to move away from service discovery as we can't cycle our instances without errors.

donaltuohy avatar May 18 '22 08:05 donaltuohy

not fixed yet :(

kocou-yTko avatar Aug 14 '22 22:08 kocou-yTko

This is a blocker for us to continue using service discovery on ECS.

will3942 avatar Oct 04 '22 10:10 will3942

is this still an issue? I was planning on using this feature - guess will need go down the alb route.

chaudharydeepak avatar Oct 04 '22 21:10 chaudharydeepak

The issue was opened in 2018 😢 hopefully a resolution will follow soon... Because of this we also opted for ECS and private ALB combo.

KlemenKozelj avatar Jan 06 '23 20:01 KlemenKozelj