AspNetCoreModule icon indicating copy to clipboard operation
AspNetCoreModule copied to clipboard

App Warmup failed under stress if backend process failed to start

Open pan-wang opened this issue 7 years ago • 60 comments

Reported from AppService that App WarmUp sometime failed for core app. so far we cannot repro on local server. On possibility is about how ANCM handles dotnet.exe start failure, i.e., child request for retry. Will change the retry mechanism without child request.

pan-wang avatar Nov 23 '17 04:11 pan-wang

I am seeing issues very consistent with this behavior. ASP Net Core 2.0 app hosted on Azure App Service. Occasionally any app condition that results in a restart or recycle of the app causes the app to stop responding and reporting 502.3 results consistent with kestrel timeouts. The app does not recover unless we manually restart the app service, change the instance count, or perform a slot swap.

nforysinski avatar Dec 22 '17 22:12 nforysinski

We get 500s after a platform event has occurred such as a restart or a new worker Is assigned, and definitely not when the App is under load. Only way to resolve is with a restart of the App Service.

olitomlinson avatar Dec 22 '17 23:12 olitomlinson

@nforysinski @dotdestroyer if you guys have any logs (application or ETW), feel free to share with me at my personal email. We are investigating this issue.

jkotalik avatar Dec 22 '17 23:12 jkotalik

@jkotalik i sent you some info, but wanted to ask if you thought rolling back to a previous version of netcore could help. We are running on 2.0.1 and dont mind rolling back to 1.x for the time being until there is a fix. This problem is super disruptive to us (and made worse by it being random and unexpected) so we wanted to see if there was any kind of work-around to get us through the time being.

Also wanted to share that all of the error logging in azure is not helpful because it appears nothing it making it to our app pipeline. There are no event logs for the 502 that occurs, there is nothing in app insights for it. Its as if the requests never make it to our app at all and the w3wp process keeps restarting over and over. This is often accompanied by extremely high thread counts.

nforysinski avatar Dec 23 '17 02:12 nforysinski

It is nothing related with asp.net core version. The issue is appwarmup module sends a faked request to warm up the application and listens on io completion synchronizely. For some reason, the io callback never happens (could be a leak of request in aspnetcore module which is global singleton and not controlled by the app or bug in io layer) and thus causes IIS worker process hangs. So far we cannot repro the issue on local IIS server. The simple workaround is to remove warmup module from your site (if your app does not use appwarmup feature) by using site extension. We are working on the change of aspnetcore module to simplify process start logic. It will help to reduce the chance of process failure.

pan-wang avatar Dec 23 '17 02:12 pan-wang

@pan-wang we are hosting on azure app services so we don't have that level of control over IIS, unless im missing something.

nforysinski avatar Dec 23 '17 03:12 nforysinski

App Service does allow user to tune IIS settings via site extension. One of our CCS did it for another customer. Let me ping him to get the whole solution and then update you. Please expect some delay due to holiday.

pan-wang avatar Dec 23 '17 03:12 pan-wang

How.to.disable.preload.docx

@nforysinski could you please try the steps in the attached doc to disable app warmup?

pan-wang avatar Dec 23 '17 04:12 pan-wang

@pan-wang we set this up a few days ago and everything seems to be working much better. several app restarts have come through fine and the application hasn't gone into the 502 issues we were seeing previously.

nforysinski avatar Dec 27 '17 22:12 nforysinski

Could you explain what preload is and why disabling it will likely fix this issue? What are the cons from disabling preload?

Also what is the warmup module. How would I know if I am using it?

rollsch avatar Jan 02 '18 02:01 rollsch

@pan-wang : When I try to create the file applicationHost.xdt I get the following error:

409 Conflict: Could not write to local resource 'D:\local\Config\applicationhost.xdt' due to error 'Access to the path 'D:\local\Config\applicationhost.xdt' is denied.'.

Am I creating the file in the wrong folder? I assumed the file had to live next to the original (applicationhost.config).

Also, what is the easiest way to test if preloading is disabled? So I can see if the workaround is implemented successfully.

nphmuller avatar Jan 02 '18 10:01 nphmuller

@nphmuller - The .xdt file needs to go at the site root - d:\home\site. If you look at the applicationhost.config after a restart, you should see the change reflected in that file.

veatcha avatar Jan 02 '18 14:01 veatcha

@nphmuller See also https://github.com/projectkudu/kudu/wiki/Azure-Site-Extensions#debugging-private-extensions for more general info on using/debugging this xdt file.

@pan-wang can you update your Word doc to indicate where the xdt should be created?

davidebbo avatar Jan 02 '18 17:01 davidebbo

@davidebbo I have update the doc as you suggested @rolandh AppWarm module (aka application initialization) is an IIS native module. It is a useful feature in the scenario that your managed application needs long time to warm up (e.g., loading assemblies, setting up SQL connection, et al), while you want your application to serve some custom page during the warming up instead of just letting client waiting/spinning there. If you don't know about this module, most likely you are not using it.

pan-wang avatar Jan 02 '18 18:01 pan-wang

Faced with a similar problem. I noticed that this happens after restarting the server. Do not restart the application, namely restart the server. If you look at the field "System up time" in kudu (scm) - it perfectly correlates with the time of inaccessibility of the site. To reproduce this, I tried to restart the server. But I did not find a way to do it. Is there some way to do this or maybe simulate?

May be it linked with #226

sergei66666 avatar Jan 04 '18 08:01 sergei66666

I see this too - but ~30 second delay for some calls. Definitely all method calls are affected for sure - I had to write a script to call all my APIs the first time after a deployment so my user does not notice this slowness.

"We get 500s after a platform event has occurred such as a restart or a new worker Is assigned, and definitely not when the App is under load."

In my case under service fabric, the issue goes away after the first call.

Azure ticket: 117121117302812

Note for me, this problem was not there with 1.1 or 2.0 release. It started happening out of the blue around the time the ticket was opened. I may have updated all my nuget packages to the latest version around that time.

vdevappa avatar Jan 04 '18 15:01 vdevappa

@davidebbo and @pan-wang Is there an easy way to script this applicationhost.xdt file addition that appears to resolve this issue? We have numerous ASP.NET Core APIs that can benefit from this, and it seems like this could be generalized and scripted.

prejeanb avatar Jan 05 '18 21:01 prejeanb

@prejeanb you should be able to upload the file via the Kudu vfs API.

davidebbo avatar Jan 06 '18 02:01 davidebbo

Thanks @davidebbo . I did create a PowerShell script that automates this using the Kudu vfs API.. Note: it does require Azure PowerShell to automate the restarting of the web app.
aspnetcorefix.zip

prejeanb avatar Jan 09 '18 18:01 prejeanb

This wasn't mentioned in the post above, but you can create a generic transform that can be applied to any site like so:

<?xml version="1.0"?>
<configuration xmlns:xdt="http://schemas.microsoft.com/XML-Document-Transform">
  <system.applicationHost>
    <sites>
      <site name="%XDT_SITENAME%" xdt:Locator="Match(name)">
        <application preloadEnabled="false" xdt:Transform="SetAttributes(preloadEnabled)">
        </application>
      </site>
    </sites>
  </system.applicationHost>
</configuration>

The %XDT_SITENAME% value will be pulled from Kudu. There's a few other variables you can use as well that are listed in the Kudu wiki.

xt0rted avatar Jan 12 '18 04:01 xt0rted

@xt0rted - Thanks for sharing! That simplifies the PowerShell script some. Here is the simplified script... aspdotnetcorefixsimplified (2).zip

prejeanb avatar Jan 12 '18 20:01 prejeanb

I just wanted to add my experience in case it helps in this. I'm working with a Microsoft Support employee who's helping me with the following exception that occurs periodically during slot swapping and during auto-scaling operations of my Azure App Service:

Message: Failed to bind to address http://127.0.0.1:6629: address already in use. Exception: Microsoft.AspNetCore.Server.Kestrel.Transport.Libuv.Internal.Networking.UvException

He suggested I try this solution. I haven't yet since I'm trying to understand how it applies, but maybe it'll help your investigation.

joshmouch avatar Jan 18 '18 16:01 joshmouch

@pan-wang I don't understand this issue particularly with the address in use case. Here's why:

  1. Any time Kestrel logs "Failed to bind to address http://: address already in use.", it should always throw an exception from KestrelServer.Start and in turn WebHost.Run.
  2. This exception should bubble up through Program.Main and crash the dotnet process.
  3. ANCM should observe this process crash and attempt to restart the dotnet process with a new random port which should have a very good chance at succeeding in most scenarios.
  4. If it doesn't succeed a second time, ANCM should retry up to 10 times before ultimately failing.

Do we have any idea at which of the above steps things are falling apart? Based on other reports I've seen, the logs don't support the idea that there are 10 process start failures in a row. Do we even know if the dotnet process is crashing?

I'm starting to wonder if a bunch of applications are out there swallowing exceptions from WebHost.Run/Start that's leading to this strange behavior.

@JosephTremoulet Do you have any insight at which stage in the process crash/recovery cycle things are stalling?

halter73 avatar Jan 25 '18 01:01 halter73

I just wanted to add my experience in case it helps in this. I'm working with a Microsoft Support employee who's helping me with the following exception that occurs periodically during slot swapping and during auto-scaling operations of my Azure App Service:

Message: Failed to bind to address http://127.0.0.1:6629: address already in use. Exception: Microsoft.AspNetCore.Server.Kestrel.Transport.Libuv.Internal.Networking.UvException

He suggested I try this solution. I haven't yet since I'm trying to understand how it applies, but maybe it'll help your investigation.

I was also sent here by a Microsoft support employee. My problem occurs during instance changes of my WebApp hardware which result in random "AccessDenied" when accessing all of my DLLs, this locks the app as none of the DLLs can be accessed until the entire WebApp is restarted (not just the process, the actual WebApp environment must be stopped, then started again)

I've since done this and not see the error again (It was sporadic and would happen twice a week, once a month etc) but I can't be sure it is fixed.

Can anyone comment how this issue would relate to all of my .NET DLLs returning access denied when they are loaded? Case number is 117120117249053 if any other employees want to have a look.

rollsch avatar Jan 25 '18 01:01 rollsch

The root cause is that 1) AppWarmUp module send a faked request to start IIS pipeline and waiting for the IO completion of this faked request. ANCM received the faked request and tried to start backend process. if starting failed at first time, ANCM will create a child request to try again. Somehow, the IO completion for the parent/child request got lost. AppWrmUp module holds IIS pipeline forever. This error happens with parent/child request pattern. I will simply the starting logic to abandon the child request in the new code.

pan-wang avatar Jan 25 '18 01:01 pan-wang

@halter It sounds like you are saying this exception can be ignored. Although I would wonder why a random port has a 99% chance of already being in use.

joshmouch avatar Jan 25 '18 01:01 joshmouch

@rolandh if you hit access deny for loading your assemblies, you definitely had a different problem. Do you have any log showing this access deny? @joshmouch , ANCM will retry to start the application again as @halter mentioned. In multitenant scenario, the chance of port conflict gets higher as the number of hosted applications increase, especially some application may consume more ports, e.g., using SQL connection.

pan-wang avatar Jan 25 '18 04:01 pan-wang

@pan-wang yes I have a massive amount of logs, memory dumps and diagnostic information attached to my support case. Are you an MS employee who can access it? REQ:117120117249053

rollsch avatar Jan 25 '18 04:01 rollsch

@rolandh I don't have the permission to access your case. Could you please share some logs at OneDrive or shot me email at panwang"AT"microsoft"dot"com. I can take a look at it.

pan-wang avatar Jan 25 '18 04:01 pan-wang

I now understand the root cause of this warm up issue. It seems ANCM did not handle the async return of child request (which is created for retry purpose in case starting Kestrel process failed). Will fix it soon.

pan-wang avatar Jan 25 '18 23:01 pan-wang