aws-otel-collector icon indicating copy to clipboard operation
aws-otel-collector copied to clipboard

Fargate/ECS healthcheck

Open pauldoherty-optifly opened this issue 2 years ago • 9 comments

Describe the question Hi all, I have an issue getting the healthcheck to function with Fargate. I followed the instructions and installed the sidecar but cannot get the sidecar healthcheck to be healthy. This means that my service keeps getting killed because ECS thinks the aws-otel-collector sidecar is unhealthy.

Steps to reproduce if your question is related to an action Service is provisioned with CDK. The sidecar health check is specified as follows:

healthCheck: {
  command: ["CMD-SHELL", "curl -f http://127.0.0.1:13133/ || exit 1"],
  timeout: Duration.seconds(10),
  startPeriod: Duration.seconds(10),
},

What did you expect to see? The sidecar would be found to be healthy

Additional context Looking at the Dockerfile here it looks like aws-otel-collector is build from scratch and so will not have curl, or even a shell for that matter. How are health checks expected to be configured?

Thanks

pauldoherty-optifly avatar Apr 05 '22 11:04 pauldoherty-optifly

Could you please provide your Collector Config that you used when setting up the ADOT Collector?

bryan-aguilar avatar Apr 05 '22 14:04 bryan-aguilar

Hi,

Thanks for getting back to me. I just used the standard insights config. E.g.

taskDefinition.addContainer("otelContainer", {
      image: ContainerImage.fromRegistry("public.ecr.aws/aws-observability/aws-otel-collector:latest"),
      command: ["--config=/etc/ecs/container-insights/otel-task-metrics-config.yaml"],
      essential: false,
      portMappings: [...],
      healthCheck: {
        command: ["CMD-SHELL", "curl -f http://127.0.0.1:13133/ || exit 1"],
        timeout: Duration.seconds(10),
        startPeriod: Duration.seconds(10),
      }
}

pauldoherty-optifly avatar Apr 05 '22 14:04 pauldoherty-optifly

Currently I don't have any Collector CDK documentation to point you toward so this may require some experimenting.

I can setup a similar environment and see what I can discover on my side. Is there any other CDK environment information that could be useful for when I build out my own CDK deployment?

bryan-aguilar avatar Apr 05 '22 16:04 bryan-aguilar

What version of CDK are you using?

bryan-aguilar avatar Apr 05 '22 16:04 bryan-aguilar

The latest v2.17

I really don't think CDK has anything to do with it though. Fundamentally I am unsure how you are supposed to run the healthcheck when on the Fargate/ECS sidecar. Given the healthcheck is run on the sidecar and the otel image doesn't have a shell or curl etc how can ECS consider it healthy?

The only option I believe I have for the healthcheck definition is to use the shell e.g. command: ["CMD-SHELL", ...

Here's a fairly minimal example which should illustrate it, https://github.com/pauldoherty-optifly/fargateOtelExample

pauldoherty-optifly avatar Apr 05 '22 17:04 pauldoherty-optifly

I could obviously take container aws-otel-collector image and add to it then publish it myself but the documentation makes no reference to having to do that

pauldoherty-optifly avatar Apr 05 '22 17:04 pauldoherty-optifly

Hi @pauldoherty-optifly,

I am going to bring this to the team and see if we can provide an official recommendation. I will reach back out here when I have more information.

bryan-aguilar avatar Apr 05 '22 21:04 bryan-aguilar

Thanks 👍

pauldoherty-optifly avatar Apr 06 '22 07:04 pauldoherty-optifly

Hi @pauldoherty-optifly ,

We do see the issue here. We are working on a solution currently and have added it to the backlog milestone. I will leave this issue open and ensure that is mentioned when a PR is created with a fix.

bryan-aguilar avatar Apr 08 '22 18:04 bryan-aguilar

we have now added the healthcheck component with the new ADOT collector release v0.23.0.

PaurushGarg avatar Nov 02 '22 22:11 PaurushGarg

Closing Issue as PR for this issue is merged and is part of collector v0.23.0

PaurushGarg avatar Nov 03 '22 05:11 PaurushGarg