azure-functions-host
azure-functions-host copied to clipboard
Improve file change detection to be more reliable
It appears that file change notifications are highly unreliable especially with Azure Files. The UI has a lot of issues when the runtime misses file change notifications for whatever reason. maybe a periodic polling of the file system for changes or an API to force host restart could help users force recompiling or reloading their functions.
@ahmelsayed - Do you have any specific scenarios where this happens frequently? We've investigated these issues in the past and it can be hard to reproduce reliably to test. We have an improvement to try to centralize and improve these notifications.
no specific scenarios, we just know that it happens and as you said is very hard to reproduce or track down, but we know it does happen. Usually with changes to run.csx that require a recompilation.
clearing milestone to discuss again in triage
Should split this up into the improvements we decide to go with.
Moving this to Triaged. Just had a CRI where almost all instances for a site went down for a couple of reasons:
- They missed the event that app_offline.htm was deleted, so they stayed offline forever.
- They started up and weren't able to find (at least) one assembly. So the host threw exceptions like "Cannot find function 'Run'" -- this disabled all functions and left it in an effectively offline state.
This behavior also causes the data role to actively refuse scale-out requests as it looks like no functions are running so it doesn't want to overprovision.
I agree with @ahmelsayed's original suggestion -- our FileSystemWatcher code should be more resilient. Maybe maintain an internal list of files and periodically check by itself and fire events, rather than relying on the OS events?
Another option is to get more strict about our "Run from package" suggestion -- this case would have been prevented by this.
This may get pushed to another Sprint as I have some higher priority items for this Sprint. Please let me know if anyone has concerns. If so, please state them here and I can re prioritize this accordingly.
Dotnet file watchers don't work in a linux container without DOTNET_USE_POLLING_FILE_WATCHER=true. I'd close this issue.
@ahmelsayed, maybe this issue got a little side tracked.
But, file watchers even for Windows have been a bit unreliable as @brettsam said above. Specially with the case of app_offline.htm and we are seeing a decent amount of such cases.
Brett had suggested offline that maybe we should move FileWatcher from JobHost to WebHost level such that even if JobHost restarts due to a file change, the FileWatcher will not restart, and hopefully not miss any file events.
Current Scenario --
- File Watcher is running and deployment starts
- App_offline.htm is generated
- Files are updated
- Host restarts, file watcher shuts down
- Host looks at the app_offline.htm, declares that it's offline
- App_offline.htm is removed
- File Watcher starts
- Host stays offline
With the fix, it should be --
- File Watcher is running and deployment starts
- App_offline is generated
- files are updated
- Host restarts, but File Watcher continues to detect changes
- Host looks at the app_offline.htm, declares that it's offline
- App_offline is removed
- WebHost catches it and restarts
Moving this to Triaged as I'm not sure we're tracking this work for sprint 60. Please adjust if needed.
This issue still exists today. We can miss app_offline.htm delete events and leave a site offline until a restart, causing availability issues. Deploying via Run from Package fixes this, but not every site has moved there.
We need to revisit this, just an initial thought (well, maybe not initial):
- While offline, don't trust the file watchers. Poll for
app_offline.htm. - If we're offline and this file doesn't exist, just restart the process to get into a good state?
- Maybe also log a warning that customers may see, which says something like "you should consider Run from Package".
We already have a detector for this that suggests moving the deployment to Run from Package, but you almost need to get bitten by downtime to see it.