containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[FARGATE][ECS] [request]: Stop Truncating the output for Task Failures

Open dsalamancaMS opened this issue 3 years ago • 47 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request What do you want us to build?

Which service(s) is this request for? This could be Fargate, ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Currently Tasks that fail with long reasons, get their outputs truncated, which limits debugging on failures.

Example:

CannotPullContainerError: containerd: pull command failed: time="2020-XX-XXTXX:XX:XXZ" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:XXXXXXXX: context ca...

The output is truncated after "Context ca" which we assume is Context Canceled.

There are more examples but currently not at hand

Are you currently working around this issue? Not possible to workaround

Additional context no context

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

dsalamancaMS avatar Oct 30 '20 22:10 dsalamancaMS

Adding a note to say that in some contexts, this can truncate literally the most important parts of an error, example given:

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve secret from asm: service call has been retried 1 time(s): failed to fetch secret arn:aws:secretsmanager:us-east-1:xxxxxxxxxxx...

maxgoldberg avatar Mar 16 '21 18:03 maxgoldberg

Something similar happened to me recently:

CannotPullContainerError: containerd: pull command failed: time="2021-XX-XXTXX:XX:XXZ" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:XXXXXXXX: write /var...

These truncated messages are very annoying since the most useful part of the reason is not there. Also, AWS support is also not able to figure out the reason. I am still looking for the reason why my tasks keep getting STOPPED.

kunalsawhney avatar Mar 22 '21 10:03 kunalsawhney

Any update on this ticket? We are also running into this. It makes it very hard to debug operational issues in Fargate.

cmwilhelm avatar Apr 09 '21 23:04 cmwilhelm

This is very annoying, I have like 20 secrets and I don't know which one failed to be pulled.

Sytten avatar Apr 15 '21 15:04 Sytten

+1. Running across this very same issue, and it's frustrating not being able to tell what just caused my task to fail.

aitorres avatar Apr 20 '21 17:04 aitorres

Any update on this ticket ? Running into this same issue where the stopped reason is truncated and make it very difficult to investigate exactly what is the cause of the error that made the task stop...

singsonn avatar May 11 '21 09:05 singsonn

Running into this same issue... This issue is temporarily solved after I deleted all images in ECR and recreated one yesterday. However, this happens today. This is so frustrating...

yasunaga-shuto avatar May 12 '21 01:05 yasunaga-shuto

In case this helps anyone else, in my case I knew my container had outbound internet access (so the issue wasn't subnet/ IGW related) which meant it could only be something wrong with the IAM policy I'd used for my task.

I double checked my secret ARN and realised that AWS had helpfully appended a '-aBcD' string onto the end of my secret name (I'd assumed that the ARN would just end with the secret name I'd specified...) so I updated my policy and it's working fine.

a-foster avatar May 25 '21 09:05 a-foster

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 1 time(s): InvalidParameterException: Invalid parameter at 'registryIds' fail...

what does it means ? can somebody please suggest me i have been facing this issue from last 3 days. before it was working fine in all 13 cluster with same configuration. now i tried making one more cluster with 1.4.0 fargate and this issue came in. now 8 clusters showing this error.. i have tried everything on internet but issue still remains. any lead please ...

Prateek-Tyagi avatar May 28 '21 05:05 Prateek-Tyagi

Something similar happened to me recently:

CannotPullContainerError: containerd: pull command failed: time="2021-XX-XXTXX:XX:XXZ" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:XXXXXXXX: write /var...

These truncated messages are very annoying since the most useful part of the reason is not there. Also, AWS support is also not able to figure out the reason. I am still looking for the reason why my tasks keep getting STOPPED.

Hi Kunal, were you able to resolve this error? I am also facing the same, and am clueless on how to resolve this.

carthic1 avatar Jul 16 '21 13:07 carthic1

Hi @carthic1 @kunalsawhney , I faced the same issue:

Something similar happened to me recently:

CannotPullContainerError: containerd: pull command failed: time="2021-XX-XXTXX:XX:XXZ" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:XXXXXXXX: write /var...

These truncated messages are very annoying since the most useful part of the reason is not there. Also, AWS support is also not able to figure out the reason. I am still looking for the reason why my tasks keep getting STOPPED.

The main reason for my case was the size of the Docker image that my Fargate task was trying to pull.

The size of my image is really big (+21GB) and the current limit of storage in a Fargate task is 20GB, looking for the AWS doc I found the EphemeralStorage parameter of a ECS Task Definition, adding a considerable size solved the issue:

EphemeralStorage:
        SizeInGiB: 30

I hope this helps you

References: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-ecs-taskdefinition-ephemeralstorage.html https://docs.aws.amazon.com/AmazonECS/latest/developerguide/fargate-task-storage.html

marcoaal avatar Jul 22 '21 11:07 marcoaal

@carthic1 the issue is due to lack of storage in the ECS tasks. AWS has launched new feature of being able to attach EphemeralStorage of upto 200 GB to your tasks. You can use this capability and increase your task storage.

kunalsawhney avatar Jul 22 '21 12:07 kunalsawhney

@carthic1 it was a storage problem with me, and solved it as @marcoaal and @kunalsawhney mentioned. If you are using aws-copilot, just add these two lines in the manifest.yml file for the task

storage:
  ephemeral: 35

35 can be changed up to 200 GB of ephemeral storage

mohamedFaris47 avatar Aug 01 '21 14:08 mohamedFaris47

Is there anywhere in all of AWS that doesn't truncate this message? Can I run a CLI command, for instance, that will pull the full message? It's really an absurdly short character limit for the purpose.

dezren39 avatar Aug 11 '21 19:08 dezren39

@dezren39 no, it's truncated everywhere. it's pretty absurd

gshpychka avatar Aug 11 '21 19:08 gshpychka

Hi,

I am also getting this error: CannotPullContainerError: containerd: pull command failed: time="2021-08-20T13:58:49Z" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:e246d4b4c5f108af0f72da900f45ae9a37e1d184d8d605ab4117293b6990b7b8: write /var... Network

I am trying to add ephemeral storage using console in task definition by clicking on "Configure via json", and adding the below lines: "ephemeralStorage": { "sizeInGiB": "25" }

but now getting the error: Should only contain "family", "containerDefinitions", "volumes", "taskRoleArn", "networkMode", "requiresCompatibilities", "cpu", "memory", "inferenceAccelerators", "executionRoleArn", "pidMode", "ipcMode", "proxyConfiguration", "tags", "placementConstraints"

I can't add extra storage now. Can anyone help me here ?

adesgautam avatar Aug 20 '21 14:08 adesgautam

That's out of scope for this issue.

gshpychka avatar Aug 20 '21 15:08 gshpychka

Hi @marcoaal @kunalsawhney @mohamedFaris47 How were you able to resolve the issue ? Can you please help me resolve this issue below ?

Hi,

I am also getting this error: CannotPullContainerError: containerd: pull command failed: time="2021-08-20T13:58:49Z" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:e246d4b4c5f108af0f72da900f45ae9a37e1d184d8d605ab4117293b6990b7b8: write /var... Network

I am trying to add ephemeral storage using console in task definition by clicking on "Configure via json", and adding the below lines: "ephemeralStorage": { "sizeInGiB": "25" }

but now getting the error: Should only contain "family", "containerDefinitions", "volumes", "taskRoleArn", "networkMode", "requiresCompatibilities", "cpu", "memory", "inferenceAccelerators", "executionRoleArn", "pidMode", "ipcMode", "proxyConfiguration", "tags", "placementConstraints"

I can't add extra storage now. Can anyone help me here ?

adesgautam avatar Aug 23 '21 05:08 adesgautam

Hi @adesgautam,

The extra ephemeralStorage storage for the Fargate tasks cannot be configured through the console.

You can refer to this doc for details on what all options are supported: https://aws.amazon.com/about-aws/whats-new/2021/04/amazon-ecs-aws-fargate-configure-size-ephemeral-storage-tasks/

It clearly states that you can use any of "AWS Copilot CLI, CloudFormation, AWS SDK, and AWS CLI"

kunalsawhney avatar Aug 23 '21 05:08 kunalsawhney

@adesgautam please don't spam this issue with off-topic questions.

gshpychka avatar Aug 26 '21 12:08 gshpychka

This gave me an "amazing" user experience today, what a shame :)

slavafomin avatar Sep 21 '21 18:09 slavafomin

Any update on this guys?

If the error message has shown in full then we can save a lot of time to build more things with AWS instead of going round and round finding for the exact error.

techministrator avatar Oct 06 '21 04:10 techministrator

this ussue was registered on my 60th birthday. it is now 1 and I am 61

chadnash avatar Nov 11 '21 00:11 chadnash

This is a very frustrating issue.

sashoalm on StackOverflow suggested that the full error message might be found in CloudTrail, but that didn't work for me: https://stackoverflow.com/questions/66919512/stoppedreason-in-ecs-fargate-is-truncated

A fix or a workaround would be great

RichardBradley avatar Nov 16 '21 11:11 RichardBradley

I've also come across this issue several times. The last of which (today) resulted in significant troubleshooting of IAM Roles and SSM Secrets because the first have of the error was regarding retrieving a 'secretsmanager' ARN. However, after a few hours of TS and eventually going to AWS Support, the issue was actually a Networking issue because the IGW was offline.

Once the last portion of the error message was found by the engineer, I saw the context timeout error message and knew exactly what it was. Please fix this, it is very frustrating.

Pettles avatar Dec 07 '21 03:12 Pettles

This issue still exists. It would be nice to get some traction on such a small issue but it would really help in troubleshooting.

rmontgomery2018 avatar Dec 15 '21 16:12 rmontgomery2018

This is a very frustrating issue.

sashoalm on StackOverflow suggested that the full error message might be found in CloudTrail, but that didn't work for me: https://stackoverflow.com/questions/66919512/stoppedreason-in-ecs-fargate-is-truncated

A fix or a workaround would be great

Thank you. This is working for me :)

  • Open AWS CloudTrail -> Event history
  • Select "user name" and search for the ID of the failed task (something like "0ba9a209db2848ejafhh17567haj16")

Than I found the API call made by ECS to SSM.

"errorMessage": "User: arn:aws:sts::<accountId>:assumed-role/ecsTaskExecutionRole/0ba9a209db2848ejafhh17567haj16 is not authorized to perform: ssm:GetParameters on resource: arn:aws:ssm:eu-central-1:<accountId>:parameter//sorry-cypress/minio_pw because no identity-based policy allows the ssm:GetParameters action",

My problem was what I used /${aws_ssm_parameter.sorry_cypress_mongo.name} in Terraform, because aws_ssm_parameter.sorry_cypress_mongo.name already starts with "/" so I ended up with "//" :)

pitthecat avatar Jan 05 '22 14:01 pitthecat

In my case, this was due to a couple of missing permissions for the ECR pull-through cache. I ended up with a policy like this on my ECS task's execution role:

{
    "Statement": [
        {
            "Action": [
                "ecr:CreateRepository",
                "ecr:BatchImportUpstreamImage"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:ecr:us-east-1:xxx:repository/ecr-public/xray/*",
                "arn:aws:ecr:us-east-1:xxx:repository/ecr-public/cloudwatch-agent/*"
            ],
            "Sid": ""
        }
    ],
    "Version": "2012-10-17"
}

acdha avatar Feb 11 '22 19:02 acdha

@accdha this is off-topic for the issue. The error can literally be dozens of different things at least, no need to post them here. The issue is about ECS truncating the error message regardless of the error or its reason.

Keep in mind that every time you post here, you're sending an email to 25 people.

gshpychka avatar Feb 11 '22 20:02 gshpychka

@accdha this is off-topic for the issue. The error can literally be dozens of different things at least, no need to post them here. The issue is about ECS truncating the error message regardless of the error or its reason.

Keep in mind that every time you post here, you're sending an email to 25 people.

If you look at the history, note that other people have been sharing non-obvious causes. The fix will be when the error messages aren’t truncated, which is why I also raised it with our TAM, but in the meantime people often benefit from suggestions for additional points to review after they’ve exhausted the most obvious options.

acdha avatar Feb 11 '22 21:02 acdha