apprunner-roadmap icon indicating copy to clipboard operation
apprunner-roadmap copied to clipboard

Apprunner hangs on long running requests with error message "upstream connect error or disconnect/reset before headers. reset reason: connection termination"

Open avivio opened this issue 2 years ago • 33 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do * not help prioritize the request If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request What do you want us to build? When running an app with a simple frontend but a long running backend the web client will suddenly hang with the error message "upstream connect error or disconnect/reset before headers. reset reason: connection termination" This doesn't seem to effect the app itself since I see in cloudwatch the logs keep behaving as if the request is being processed. This is probably a simple timeout configuration in the load balancer or the API Gateway (if there is one). Was wondering if you could add the option to configure this timeout, or at least provide visibility into what the configuration is.

Describe alternatives you've considered Use ECS where you can configure these parameters directly

Additional context Anything else we should know?

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

avivio avatar Nov 29 '21 16:11 avivio

I was able to fix this issue by pausing and resuming the service.

jparksecurity avatar Jan 12 '22 18:01 jparksecurity

I already have this issue again today. Do we know anyone we can tag here to get some attention from AWS?

jparksecurity avatar Jan 21 '22 15:01 jparksecurity

Tag #104

jparksecurity avatar Jan 26 '22 18:01 jparksecurity

happening to me right now. after pausing and resuming still no fix. :/

leocorelli avatar Mar 02 '22 01:03 leocorelli

Same issue. No clear indication of what is going on anywhere. Is this a resource issue? Is there a problem with App Runner? None of this happens anywhere else we're running these containers, what's going on?

Mugane avatar Apr 10 '22 08:04 Mugane

I am experiencing the same issue! Even if I reduce the processing time, the 503 gets hit.

eeshan-dx avatar May 24 '22 11:05 eeshan-dx

I have contacted AWS support about this, but so far the issue has been "Work in progress" for 8 days. I hope they get back to me about this soon.

francoisvdv avatar May 24 '22 12:05 francoisvdv

I have contacted AWS support about this, but so far the issue has been "Work in progress" for 8 days. I hope they get back to me about this soon.

@francoisvdv May I know if you've got any replies back from the AWS Support regarding this issue?

eeshan-dx avatar May 30 '22 11:05 eeshan-dx

Sadly only a 'we are working on it and we have escalated it' but no solution or anything..

francoisvdv avatar May 30 '22 11:05 francoisvdv

@francoisvdv still nothing?

fracampit avatar Aug 08 '22 10:08 fracampit

After various back-and-forths the conclusion of AWS support was that it was a problem in the application. We did not agree with that conclusion and instead migrated away from App Runner to ECS. So sadly no solution other than not using App Runner.

francoisvdv avatar Aug 08 '22 15:08 francoisvdv

Had this same issue about 1 year ago when testing out AWS AppRunner, and now experiencing the same issue again but not quite as often as 1 year ago, but still 😞 ... Will have to move back to ECS Fargate again.

Below is the reply I got from AWS Support July 3. 2021.

Hello,

After further investigation from the service team,

App Runner uses Fargate tasks in the backend to spinup the application instances. When the the application is not receiving any requests, Fargate automatically reduces the CPU allocated to the task. (Idle State) Once there are new active requests, task CPU allocation increases to be able to respond to incoming requests (Active State).

The issue is related to Fargate task not getting allocated CPU even after receiving new requests.

Backend Fargate tasks are put to sleep since they are not receiving any active requests for extended period of time. New incoming requests might face network timeout issues leading to 503's. Since the Fargate task is not being able to re-allocate >CPU for serving new incoming requests.

Unfortunately, there is no way to mitigate this issue at that point.

The internal service team are working to find a solution. However, it may take some time.

You can track any new release information at either of the following locations[1][2].

References: [1] https://aws.amazon.com/new/ [2] https://github.com/aws/apprunner-roadmap/issues

The active auto scaling policy for AppRunner service (screenshot from AWS UI): Screenshot 2022-08-15 at 11 17 52

I think this scaling policy ☝️ with Minimum size configured and the AWS Support response

Backend Fargate tasks are put to sleep since they are not receiving any active requests

Is contradicting/misleading (if true) as I would expect to have 5 instances ready/active at all times (with CPU) to handle incoming requests.

mikaelcabot avatar Aug 15 '22 09:08 mikaelcabot

Still getting this. When is a solution expected?? This is rendering AppRunner completely unusable. The whole point is to abstract scaling, but then the thing is incapable of scaling altogether?! What is the point? Why did you even launch this service?

Mugane avatar Aug 29 '22 18:08 Mugane

use nginx proxy it will solve the issue, for mine it solved by using nginx.

SJANAKIVENKATA avatar Nov 18 '22 09:11 SJANAKIVENKATA

Sorry for no response here for a long time on this issue. I would like to help on this. @Mugane @mikaelcabot @francoisvdv Is it possible that you can share service ARN for an affected service?

amitgupta85 avatar Nov 18 '22 10:11 amitgupta85

Sorry for no response here for a long time on this issue. I would like to help on this. @Mugane @mikaelcabot @francoisvdv Is it possible that you can share service ARN for an affected service?

We moved away from AppRunner because of this issue so we no longer have ARNs available.

francoisvdv avatar Nov 18 '22 10:11 francoisvdv

Yes, but I'm not at my workstation, I'll update tomorrow

On Fri, Nov 18, 2022, 5:02 AM Francois van der Ven @.***> wrote:

Sorry for no response here for a long time on this issue. I would like to help on this. @Mugane https://github.com/Mugane @mikaelcabot https://github.com/mikaelcabot @francoisvdv https://github.com/francoisvdv Is it possible that you can share service ARN for an affected service?

We moved away from AppRunner because of this issue so we no longer have ARNs available.

— Reply to this email directly, view it on GitHub https://github.com/aws/apprunner-roadmap/issues/92#issuecomment-1319785932, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADDHIZ34PRJFQOZLI3AADLTWI5H5DANCNFSM5I7PUIAQ . You are receiving this because you were mentioned.Message ID: @.***>

Mugane avatar Nov 19 '22 03:11 Mugane

I am experiencing the same issue when running a NextJS app on AppRunner. The Issue is happening when I run Google PageSpeed insights against the app. It is a shame really because it turns out that AppRunner is actually not scalling well enough and there is nothing I can do as a user. No matter what 'scaling policy' I use or what vCpu/Ram configuration the app is crashing from a simple google page speed test....

mstoyanovv avatar May 16 '23 22:05 mstoyanovv

hi @mstoyanovv try to use nginx as a proxy it will resolve the issue

SJANAKIVENKATA avatar May 17 '23 04:05 SJANAKIVENKATA

hi @SJANAKIVENKATA, how did you use nginx with AppRunner?

mstoyanovv avatar May 17 '23 07:05 mstoyanovv

hi @mstoyanovv just use it for proxy only and static files, no need to configure certificate ehy because apprunner will provide https.

SJANAKIVENKATA avatar May 17 '23 07:05 SJANAKIVENKATA

hi @mstoyanovv try to use nginx as a proxy it will resolve the issue

How would this possibly make any difference? Wouldn't it just offload the error from the initial request to the internal proxy request? That doesn't solve apprunner hanging.

Mugane avatar May 18 '23 20:05 Mugane

@Mugane I created another AppRunner instance that hosts Nginx configured as proxy and cache of static files. It solved the issue that I had with google PageSpeed Insights. Also, when stress testing the app with Ddosify it does handle the traffic better.

mstoyanovv avatar May 18 '23 22:05 mstoyanovv

Hello @mstoyanovv, could you provide the service arn so that we can take a look?

smeera381 avatar May 23 '23 22:05 smeera381

Sorry for no response here for a long time on this issue. I would like to help on this. @Mugane @mikaelcabot @francoisvdv Is it possible that you can share service ARN for an affected service?

We have also moved away from AppRunner because of this issue.

But going cack to the response I got from AWS Support

The issue is related to Fargate task not getting allocated CPU even after receiving new requests.

Unfortunately, there is no way to mitigate this issue at that point. The internal service team are working to find a solution. However, it may take some time.

... So has a fix been applied targeting this issue? (Asking to know if it's worth spending time on testing this again).

mikaelcabot avatar May 24 '23 09:05 mikaelcabot

Hello @mstoyanovv, could you provide the service arn so that we can take a look?

where can I contact you @smeera381 ?

mstoyanovv avatar May 24 '23 09:05 mstoyanovv

Hello @mstoyanovv If you could share the service arn here, I can take a look.

msumithr avatar May 24 '23 22:05 msumithr

we are having the same issue on a prod app with NextJS.

With APIGW works and with App Runner it does not.

Not testing with google but with our own nextjs website. Sometime it hangs and we cant do anything about it.

request latency went up at 18:30 UTC image

and we start having same errors

Screen Shot 2023-05-30 at 23 00 01 Screen Shot 2023-05-30 at 23 00 12

atrope avatar May 30 '23 20:05 atrope

Thank you @mstoyanovv. Taking a look. Hello @atrope, please feel free to share your service arn details here and we will take a look.

msumithr avatar Jun 01 '23 16:06 msumithr

arn:aws:apprunner:us-east-1:384537834093:service/genuine-project-ffub8-app/e0d044541e0c43b38894b06e88c3b36c

atrope avatar Jun 01 '23 16:06 atrope