azure-functions-host icon indicating copy to clipboard operation
azure-functions-host copied to clipboard

Investigate how to handle the situation where Disabled Functions can continue to run during a Slot Swap until the swap is completed.

Open FinVamp1 opened this issue 3 years ago • 8 comments

What problem would the feature you're requesting solve? Please describe.

The issue is that during a slot swap one of the phases is that we Warmup the slot to be swapped in with the Production Slot App Settings. This can result in a Function within the app switching to enabled because the App Setting is not set. This won’t affect HTTP Triggers as the bindings have not been updated yet but it will affect non-Http Activations.

Describe the solution you'd like

This could be a very temporary issue or could lead to additional processing of Events depending on how long the Warmup of the slot takes before it is swapped into production.

We need either some way to respect the disabled setting during thuction. e swap setting or have another configuration flag to detect customer intent. E.G Feature Flag or configuration setting do not enable disabled Functions. The issue here is that when the production app settings are applied to the slot we no longer consider the Function to be disabled.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Slots Swap Process

In this case the slot swap status went like this.

  1. The App Settings and Connection Strings that are marked as “Slot” are read from the Production slot and applied to the site in the Staging slot. This removing the FunctionApp.Disabled App Setting and resulted in the TimerTrigger being enabled again. Once the App Settings are enabled to the staging slot the Staging slot will restart.
  2. The site is now in Warmup mode. In this mode we send HTTP Requests to the App to make sure it’s healthy. This will result in the Functions Host starting and the Functions starting up that are not disabled.

The disable Functions document shows how this can be mitigated. https://learn.microsoft.com/en-us/azure/azure-functions/disable-function?tabs=portal#functions-in-a-slot

FinVamp1 avatar Dec 22 '22 19:12 FinVamp1

Putting a note here because I always forget how this works, but the piece I always forget:

During a swap, one step is that App Service will start the stage slot with the production app settings and ping it to ensure it is not completely busted before swapping.

Time triggers do not like this because they acquire their lock based on the site name (specifically the site host name) -- so you can end up with a timer firing from both a prod slot and a stage slot.

brettsam avatar Jun 22 '23 16:06 brettsam

I've just created a repro with a timer that does this (writing to App Insights):

[FunctionName("Timer")]
public void Run([TimerTrigger("0 */5 * * * *", RunOnStartup = true)] TimerInfo myTimer, ILogger log)
{
    string slotName = Environment.GetEnvironmentVariable("WEBSITE_SLOT_NAME");
    string onlyTrueOnProd = Environment.GetEnvironmentVariable("OnlyTrueOnProd");

    log.LogInformation($"Slot:  {slotName}; OnlyTrueOnProd: {onlyTrueOnProd}");
}

I've created a slot and:

  • enabled the function on production and disabled on staging
  • set a production-only app setting of OnlyTrueOnProd: true
  • done a few swaps

In App Insights I can clearly see that I get OnlyTrueOnProd from the background slot during a swap. This is very brief as it's only starting up, pinging it for health, then shutting down. And I've explicitly set RunOnStartup so that my timer is guaranteed to run (a lot of times it won't).

For starters, it feels wrong that this timer would ever fire from a staging slot -- but seeing the log confirms that this is the "run staging slot with prod settings" warmup phase.

image

brettsam avatar Jun 22 '23 18:06 brettsam

Given all of this above -- I think that a suitable workaround to ensure that a disabled timer only runs in production would be to add something like this to the top of the Timer function. That way if/when it does fire in a different slot, it will be a no-op.

// null here would indicate running locally
string slotName = Environment.GetEnvironmentVariable("WEBSITE_SLOT_NAME");
if (slotName != null && !slotName.Equals("production", StringComparison.OrdinalIgnoreCase))
{
    return;
}

brettsam avatar Jun 22 '23 19:06 brettsam

Not a fan of this suggestion. Code shouldn't need to be aware of of environments or slots and take special action in these cases. We inject configurations that are environment specific for this exact reason. The issue here is that the runtime is executing something we told it should be disabled.

wjdavis5 avatar Jun 23 '23 16:06 wjdavis5

@wjdavis5 -- 100% agree. This is a workaround.

We're discussing internally what we can do here but it may involve some infrastructure updates that can take a while to roll out. I wanted people to know that there is an option if they're running into this problem.

Once there's a true fix, it'll be captured here.

brettsam avatar Jul 12 '23 21:07 brettsam

Given all of this above -- I think that a suitable workaround to ensure that a disabled timer only runs in production would be to add something like this to the top of the Timer function. That way if/when it does fire in a different slot, it will be a no-op.

// null here would indicate running locally
string slotName = Environment.GetEnvironmentVariable("WEBSITE_SLOT_NAME");
if (slotName != null && !slotName.Equals("production", StringComparison.OrdinalIgnoreCase))
{
    return;
}

This isn't a workaround because WEBSITE_SLOT_NAME isn't immediately updated after a swap. And even worse, if it's running on linux, depending on how many instances there are, won't be updated until the instances are recycled.

image

as you can see in the screenshot, this is the behavior I'm seeing on a linux function that was swapped. It correctly ONLY runs the new code after the swap, but the WEBSITE_SLOT_NAME is not updated on all instances.

on a windows instance we will see the function trigger only once with the incorrect WEBSITE_SLOT_NAME, probably due to it restarting immediately after the swap.

bcrispcvna avatar Jul 17 '23 16:07 bcrispcvna

as a sidenote, for our workaround we're using the managed service identity since it seems the most reliable at this point

bcrispcvna avatar Jul 17 '23 16:07 bcrispcvna

Not sure what the status of this is but this is a serious issue for us that impacts our ability to use slots with TimerTrigger. The workaround proposed here doesn't seem to work because on Windows hosted apps when performing a swap between a "production" and "staging" slot the WEBSITE_SLOT_NAME env variable gets updated to "production" while the swap is being performed. This then allows the staging site to execute the timer trigger.

staging site vars before swap:

WEBSITE_SLOT_NAME = staging AzureWebJobs.MyFunction.Disabled = 1

staging site vars during swap:

WEBSITE_SLOT_NAME = Production AzureWebJobs.MyFunction.Disabled = 0

staging site vars after swap:

WEBSITE_SLOT_NAME = staging AzureWebJobs.MyFunction.Disabled = 1

jimSampica avatar May 02 '24 14:05 jimSampica