Azure-Functions icon indicating copy to clipboard operation
Azure-Functions copied to clipboard

Cannot swap a V4 and V3 function deployed in separate staging slots

Open jackbatzner opened this issue 4 years ago • 21 comments

We are looking to upgrade from V3 to V4 functions and ran into an issue when testing a deployment out.

We have a V3 function deployed with staging slots currently. We deployed the V4 function to the staging slot which contained a V3 function which worked. When we swapped the functions we are receiving a 400 back and could not complete the operation. As soon as we set the "production" staging slot's runtime to ~4 the swap completed just fine.

I then went ahead and created a sample function to test this out a bit further. I created a V4 & V3 sample function. I deployed the V3 function to staging and prod slots and they worked just fine. I then deployed the V4 function to the staging slot where the V3 function was. Then we swapped the slots and operation succeeded but resulted in a broken function. The FUNCTIONS_EXTENSION_VERSION was not swapped causing the function to break. As soon as I updated the version from 3 to 4 the app started working.

I then went ahead and attempted to implement the "fix" based on recommendation from #925 by setting WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS to 1. This also did not end up working, I even attempted to set the setting as a deployment slot setting which did not work.

What issues do I have?

  • Cannot swap a V3 and V4 function
  • Swapping a V3 and V4 function doesn't respect the FUNCTIONS_EXTENSION_VERSION even after setting WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS to 1

jackbatzner avatar Dec 03 '21 21:12 jackbatzner

Confirmed the issue via an ARM based deployment (which is ultimately what we want to work). With WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=1 set in both the production and staging slots the swap failed:

##[error]Error: Failed to swap App Service 'REDACTED' slots - 'staging' and 'production'. Error: BadRequest - Cannot swap slots for site 'REDACTED' because the application initialization in 'staging' slot either took too long or failed. Please check AppInit module configuration or try using swap with preview if application initialization time is very long. (CODE: 400)
Finishing: Swap Slots: REDACTED

kendaleiv avatar Jan 19 '22 14:01 kendaleiv

@fabiocav, @v-bbalaiagar - Is this a valid workflow for upgrading an app from V3 to V4 function runtimes?

jackbatzner avatar Jan 19 '22 17:01 jackbatzner

Hi @suwatch / @ruslany , Could you please take a look at this issue.

v-bbalaiagar avatar Feb 07 '22 16:02 v-bbalaiagar

Are there any updates or help I can provide @v-bbalaiagar / @suwatch / @ruslany ?

jackbatzner avatar Feb 18 '22 14:02 jackbatzner

Also curious about the timing/ETA of this as it's currently a blocker for my team to move to .NET 6 - any updates would be much appreciated. Thanks in advance!

mkowalskigps avatar Mar 01 '22 16:03 mkowalskigps

Also curious about the timing/ETA of this as it's currently a blocker for my team to move to .NET 6 - any updates would be much appreciated. Thanks in advance!

I second this - is there any way we can make the FUNCTIONS_EXTENSION_VERSION not slot-sticky?

dxynnez avatar Mar 01 '22 16:03 dxynnez

As a .NET community we are being encouraged to move to .NET 6 / Runtime V4.

A fix for this issue is needed to avoid downtime when deploying and transitioning between versions - any updates would be much appreciated. Thanks!

mkowalskigps avatar Mar 22 '22 01:03 mkowalskigps

~~How is this 4yo problem still ongoing?~~

~~If not deployment slots, what is the proposed zero downtime upgrade solution?~~

~~Without having to deploy whole new functions and urls and migrate, preferably?~~

Edit: ok. So, read the documentation and set WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 in both slots and it works as advertised, nothing to see here move along.

Process:

  1. Set WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 on deployment slot
  2. Swap slots
  3. Set WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 on deployment slot
  4. Swap slots
  5. Set FUNCTIONS_EXTENSION_VERSION=~4 on deployment slot
  6. Swap slots
  7. Done

Froosh avatar Apr 04 '22 09:04 Froosh

~How is this 4yo problem still ongoing?~

~If not deployment slots, what is the proposed zero downtime upgrade solution?~

~Without having to deploy whole new functions and urls and migrate, preferably?~

Edit: ok. So, read the documentation and set WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 in both slots and it works as advertised, nothing to see here move along.

Process:

  1. Set WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 on deployment slot
  2. Swap slots
  3. Set WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 on deployment slot
  4. Swap slots
  5. Set FUNCTIONS_EXTENSION_VERSION=~4 on deployment slot
  6. Swap slots
  7. Done

You rock man! I tested it and it worked!

dxynnez avatar Apr 06 '22 16:04 dxynnez

Hi all,

I'm glad folks were able to work out a solution. I just wanted to note that I've created an item internally for getting that information added to https://docs.microsoft.com/azure/azure-functions/functions-versions and https://docs.microsoft.com/azure/azure-functions/functions-app-settings, and if folks have other suggestions, please let me know.

Let's leave this item open until that content is live.

mattchenderson avatar Apr 07 '22 15:04 mattchenderson

Edit: ok. So, read the documentation and set WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 in both slots and it works as advertised, nothing to see here move along.

We ended up testing this out and had success as well. Thanks for figuring out the steps @Froosh !

I do want to call out an issue with this workflow though. I ended up finding a race condition in the Functions runtime that could cause a couple seconds of downtime for your application if you use suggested workflow above.

If you set the value in the deployment slot and then swap it to the production slot the Functions runtime will throw HostInitializationException's. It appears that the Functions runtime removes the FUNCTIONS_EXTENSION_VERSION if the source and destination slot do not have the setting configured.

I was able to test and prove the theory out. I would like to revise the suggested steps to as follows,

  1. Set WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 on production slot
  2. Set WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 on deployment slot
  3. Set FUNCTIONS_EXTENSION_VERSION=~4 on deployment slot
  4. Swap slots
  5. Done

@mattchenderson - Can we add this as a note to the updated documentation? I want to ensure that people are aware of this race condition. If the right steps aren't followed people will experience downtime during upgrade which is not desirable.

jackbatzner avatar Apr 11 '22 15:04 jackbatzner

@jackbatzner For step one, I think that kinda defeat the purpose of zero-downtime upgrade - if you change any setting on the production slot, it's going to restart.

The steps that @Froosh suggested was to apply the WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 first to the staging slot, and then swap to production. That is to avoid applying any setting directly on the production slot. Was it not working for you?

dxynnez avatar Apr 11 '22 16:04 dxynnez

@jackbatzner For step one, I think that kinda defeat the purpose of zero-downtime upgrade - if you change any setting on the production slot, it's going to restart.

The steps that @Froosh suggested was to apply the WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 first to the staging slot, and then swap to production. That is to avoid applying any setting directly on the production slot. Was it not working for you?

@dxynnez - the suggested approach worked flawlessly for us. The concern here is the behavior of the Azure Functions runtime where it removes the FUNCTIONS_EXTENSION_VERSION setting while swapping slots when the WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS is set to 0. I was noticing exceptions being thrown in the production slot as a result, which means there could be some noticeable downtime in an application.

@mattchenderson / @v-bbalaiagar - Could we have someone from the team assist us with this issue? This behavior seems off to me. I wouldn't expect exceptions to be thrown in the production slot if we are swapping two working slots at different versions.

jackbatzner avatar Apr 14 '22 20:04 jackbatzner

I was noticing exceptions being thrown in the production slot as a result, which means there could be some noticeable downtime in an application.

I was able to confirm that there is a brief moment of downtime when doing this swap for the first time (Step 1 in sections listed above).

jackbatzner avatar Apr 19 '22 17:04 jackbatzner

Interesting. Presuming you have/had FUNCTIONS_EXTENSION_VERSION=~3 on the initial 'deployment' slot and 'production'?

I'll have to specifically re-test when I get a chance.

Froosh avatar Apr 27 '22 11:04 Froosh

Interesting. Presuming you have/had FUNCTIONS_EXTENSION_VERSION=~3 on the initial 'deployment' slot and 'production'?

I'll have to specifically re-test when I get a chance.

Correct @Froosh . On the "initial" deploy both slots (staging and production) had FUNCTIONS_EXTENSION_VERSION set to ~3.

@mattchenderson , @v-bbalaiagar , @anthonychu - can someone investigate this issue further? This is going to cause issues for customers if they want to migrate Functions from V3 to V4 with zero downtime.

jackbatzner avatar Apr 27 '22 13:04 jackbatzner

Apologies for missing the updates. Yes, we can get more captured on these fronts (I will be working on doc updates today) and see if there are other improvements that can be made to smooth over the process in general. CC @fabiocav

mattchenderson avatar Apr 27 '22 15:04 mattchenderson

Out of curiosity, was anyone here hitting issues with enabling .NET 6 for the app as part of this process? I didn't see that part mentioned, but the /config/web.netFrameworkVersion should be set to "v6.0" as well if running on Windows, I believe. That theoretically should just go with the change to FUNCTIONS_EXTENSION_VERSION, but I wanted to check if that's aligned with what folks are seeing here.

mattchenderson avatar Apr 27 '22 20:04 mattchenderson

Ok, so I poked around at this a bit further...

If both slots do not have WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0, the swap operation does indeed just delete FUNCTIONS_EXTENSION_VERSION from the source slot. I didn't ever see a host error with this. If the details of any errors could be provided, that would be extremely helpful. I was admittedly using simple or in some cases empty payloads.

Now, the originally production host should only have entered any error state after it stopped actually being production, as that's when we would have removed the setting in question during its restart as the new staging. The once-staging-now-production instance should be up and running with no issues (presumably) - having both the extension version and WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 - before we touch the once-production-now-staging slot. That's the theory at least.

Regardless, that behavior of deleting FUNCTIONS_EXTENSION_VERSION still feels like a bug (though I can kind of understand it), but it does point to an issue with the steps as written in this comment since production would end up in a state where it didn't have a version specified. I think the trick here is to augment the third step with an explicit runtime version assignment. I'm also adding in the netFrameworkVersion for completeness:

  1. Set WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 on deployment slot
  2. Swap slots
  3. Set WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 and FUNCTIONS_EXTENSION_VERSION=~3 on deployment slot
  4. Swap slots
  5. Set FUNCTIONS_EXTENSION_VERSION=~4 and (if Windows) config property netFrameworkVersion=v6.0 on deployment slot
  6. Swap slots
  7. Done

The reason I prefer this to setting WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 on the production slot is because we are trying to eliminate downtime, and technically the production config change would be that, too. If this addresses the host issues, I think it is preferable. Now, I'd also opt to maybe collapse steps 3-5 since we can just set everything on the deployment slot all at once, with the caveat that you would need to align the deployment payloads themselves since it's an even number of swaps. So that would give us:

  1. Set WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 on deployment slot
  2. Swap slots
  3. Set WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 and FUNCTIONS_EXTENSION_VERSION=~4 and (if Windows) config property netFrameworkVersion=v6.0 on deployment slot
  4. Swap slots

Here's the script I was playing with that seems like it's doing the right things overall:

# Get production configured with WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS via a swap
az functionapp config appsettings set --settings WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0  -g $groupName -n $appname --slot $slotName
az functionapp deployment slot swap -g $groupName -n $appname --slot $slotName --target-slot production

# Get staging configured with WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS and the new version
az functionapp config appsettings set --settings WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 -g $groupName -n $appname --slot $slotName
az functionapp config appsettings set --settings FUNCTIONS_EXTENSION_VERSION=~4 -g $groupName -n $appname --slot $slotName
# For Windows function apps only, also enable .NET 6.0 that is needed by the runtime
az functionapp config set --net-framework-version v6.0 -g $groupName -n $appname --slot $slotName

# Swap to migrate
az functionapp deployment slot swap -g $groupName -n $appname --slot $slotName --target-slot production

Does this sequence seem reasonable? I'd want to confirm that any errors that had been seen no longer appear.

I've drafted docs changes that align to the approach mentioned in this comment (which has you change an app setting directly on the production slot), but I'd like to change it over to use this zero-downtime variation if it indeed works.

mattchenderson avatar Apr 27 '22 23:04 mattchenderson

You do not have to do this 3/2 step process. You can add an additional appsettings addition step to your template which just added the version to your production slot. Add this part in your ARM template after you call your App service / Function App template ("type": "Microsoft.Web/sites")

{
  "name": "[concat(parameters('functionAppName'), '/', 'appsettings')]",
  "type": "Microsoft.Web/sites/config",
  "apiVersion": "2021-01-15",
  "dependsOn": [
      "[resourceId('Microsoft.Web/sites',parameters('functionAppName'))]"
  ],
  "properties": {
      "FUNCTIONS_EXTENSION_VERSION": "~4",
      "FUNCTIONS_WORKER_RUNTIME": "dotnet"
  }
}

snehaguptagithub avatar Sep 07 '22 22:09 snehaguptagithub

You do not have to do this 3/2 step process. You can add an additional appsettings addition step to your template which just added the version to your production slot. Add this part in your ARM template after you call your App service / Function App template ("type": "Microsoft.Web/sites")

{
  "name": "[concat(parameters('functionAppName'), '/', 'appsettings')]",
  "type": "Microsoft.Web/sites/config",
  "apiVersion": "2021-01-15",
  "dependsOn": [
      "[resourceId('Microsoft.Web/sites',parameters('functionAppName'))]"
  ],
  "properties": {
      "FUNCTIONS_EXTENSION_VERSION": "~4",
      "FUNCTIONS_WORKER_RUNTIME": "dotnet"
  }
}

As far as I know this will overwrite all your app settings in the production slot each time you deploy. And even if it doesn't, it'll trigger some downtime since you're changing configuration.

maartenkools avatar Sep 08 '22 03:09 maartenkools