azure-functions-host icon indicating copy to clipboard operation
azure-functions-host copied to clipboard

Function App is unable to retreive the latest AzureWebJobsStorage and keeps crashing

Open ness001 opened this issue 2 years ago • 9 comments

Check for a solution in the Azure portal

After the connection string in the key vault reference of AzureWebJobsStorage configuration item is updated, the function app is unable to proactively retrieve the latest secret value.

Investigative information

I will provide these things tomorrow. Please provide the following:

  • Timestamp:
  • Function App version: ~4
  • Function App name: can't paste here
  • Function name(s) (as appropriate): can't paste here
  • Invocation ID: Not invocated
image
  • Region: West US 2

Repro steps

Provide the steps required to reproduce the problem:

  1. Rotate the account key of the storage account that the AzureWebJobsStorage linked to
  2. Updating the key vault secret with the latest connection string
  3. Function app may begin crashing

Expected behavior

Provide a description of the expected behavior.

Actual behavior

Keeps popping traces said JobHostStopped. Unable to restart or stop the Function App

Known workarounds

Delete the AzureWebJobsStorage in Configuration and add AzureWebJobsStorage immediately with the same key vault reference.

Related information

Provide any related information

  • Programming language used
  • Links to source
  • Bindings used

ness001 avatar Aug 24 '22 10:08 ness001

Hi @ness001, Could you share the app name using https://github.com/Azure/azure-functions-host/wiki/Sharing-Your-Function-App-name-privately to check the cause? Also provide with the timestamp and region. Thanks

Ved2806 avatar Aug 25 '22 13:08 Ved2806

Hi @ness001, Could you share the app name using https://github.com/Azure/azure-functions-host/wiki/Sharing-Your-Function-App-name-privately to check the cause? Also provide with the timestamp and region. Thanks Thanks for the info.

I'm a user from Microsoft. Could we chat on Teams? The problem has haunted our org for some time. We want to discuss with the product team directly. @Ved2806

ness001 avatar Aug 26 '22 06:08 ness001

Hi @ness001, in that case @mattchenderson will be a good choice for discussing this.

Ved2806 avatar Aug 26 '22 08:08 Ved2806

Hi @ness001, in that case @mattchenderson will be a good choice for discussing this.

Thanks, Ved.

ness001 avatar Aug 30 '22 01:08 ness001

Hi @ness001, in that case @mattchenderson will be a good choice for discussing this.

I see Matthew is a PM of this product. Actually, I want to find a person from the development team of the product. Sorry for the unclear wording. Is there anyone from Microsoft who can help? @Ved2806

ness001 avatar Aug 30 '22 01:08 ness001

The steps provided are missing a change to prompt the app to fetch the latest value from Key Vault - if you are not performing a configuration-mutating restart (changing an app setting, for example), the existing value could still be used for up to 24 hours. I am assuming an app setting update to be done in my repro attempt, so the system is pulling whatever was last set in Key Vault. With that modification, I am unable to reproduce. Failures occur temporarily if the key in use is rotated (as expected), but the system recovers when a reference to a current value is provided. Using the 2-key rotation pattern should also help avoid this in general, as the existing value would still be valid past when the app switches to the newly rotated other key.

Assuming the above does not help, please check the environment variables present using Kudu (Advanced Tools) from within the portal. What we should determine is if the secret is properly getting resolved and updated into the environment. If you don't see the expected secret, this is a question for the Key Vault reference feature. But if it you do see the secret, the question becomes why your host is failing when restarted with the new environment configuration.

Could you also please update the initial issue description to complete the Known workarounds section? Right now, it just says "Delete the AzureWebJobsStorage in Configuration and" which does not help us understand the behavior fully. Removing AzureWebJobsStorage would also put the app into a bad state.

mattchenderson avatar Sep 02 '22 22:09 mattchenderson

I didn't have a configuration-mutating restart after the key vault secret was rotated. The thing is in my case the system didn't heal itself when it found the connection string was not working. I was not able to manually restart or stop the whole function app at that time. The problem was fixed when I deleted the AzureWebJobsStorage and added the AzureWebJobsStorage with the same key vault reference. My question is why the function couldn't recover by itself but at the same time, it could detect the configuration change and retrieve the value. I want to know from the code level on how we could see this behaviour. @mattchenderson

ness001 avatar Sep 07 '22 06:09 ness001

The removal and re-adding of AzureWebJobsStorage is effectively the same as the configuration-mutating restart. Because you did that, the system was able to perform the necessary re-fetch. But I think you could have added another app setting to get the same result.

If I'm understanding correctly, your assertion is that in the presence of an error, the system should try to refresh the secret to see if there is a new value that works, correct? That would be a feature request, and it would only really be possible if scoped to a handful of system-defined connections. Even there, I have some reservations, though it's something we can consider. Regardless, that kind of processing does not exist today. The Key Vault references feature treats the secrets opaquely and does not reason about them. It provides a set of environment variables to the app and asserts nothing about their validity. The app consumes them the same as it would a directly-configured connection string.

That you were unable to use the restart/stop actions is not expected. The app should still respond to management operations like that. I would suggest opening a support ticket and sharing the operation IDs and/or time ranges from those attempts if possible.

mattchenderson avatar Sep 13 '22 00:09 mattchenderson

Got it. Thanks for your explanation @mattchenderson

ness001 avatar Sep 22 '22 08:09 ness001