azure-webjobs-sdk
azure-webjobs-sdk copied to clipboard
BlobTrigger listener fails to start on empty container in Kubernetes scenario
Please provide a succinct description of the issue.
This behavior is seen if the Blobtrigger user code(executing inside container) runs in K8s+K4Apps(with Keda) environment.
User code is triggered if atleast 1 blob is initially present in the storage container at the time when the application first POD comes up. If the blob is added few minutes after the first POD terminates( after cooling period), the new POD coming up because of the blob added to the container is not able to see the blobs and is not getting triggered.
Application PODS are able to scale up/down when blobs are getting added/removed from the storage container.
Provide the steps required to reproduce the problem
-
Setup the K4App environment using this https://github.com/microsoft/k4apps/wiki/On-boarding-guide. Note: I am using AKS cluster with private registry for pulling the K4Apps and BlobTrigger container images.
-
Create a Storage Blob container and do not upload any blobs to it.
-
Deploy the function app using the YAML files shared after making necessary changes.(I also shared the user code for BlobTrigger which will help in creating an image. Note: You may have to copy the regcred from K8se-system namespace to K8se-apps for pulling the image)
-
Wait for sometime so that the POD which comes up with App deployment terminates after the cooling period is complete.
-
Add a blob into the Storage container.
-
New POD comes up for the app but it does not show any signs in the log about trigger getting executed.
-
Another observations:- a) The internal queue which gets created for channeling the new blob messages to the user code/function is also not getting created in storage account. This queue is getting created in working scenario(if we start with at least 1 blob in the Storage container) b) There are exceptions related to BlobListner already started. Why the Listner is getting tried many times if it has already been created ? This exception is coming irrespective of the case when BlobTrigger is able to execute or not.
#### Expected behavior
BlobTrigger should work anytime when the blob is uploaded to storage container. In non-working case the very first blob is getting added after deploying the app and waiting for cooling period time.
#### Actual behavior
BlobTrigger function is not executing if the first blob is added to the storage container after waiting for sometime(probably by waiting enough when the first POD terminates post coolDownPeriod)
#### Known workarounds
None. Probably making sure there is at least one blob in the storage container before deploying the application.
**Events_K8se_Apps.txt exceptions.txt Log_App_Contoller_2.txt Log_App_Controller_1.txt Log_first_POD.txt Log_second_POD.txt logs_keda_Operator.txt **
Provide any related information
*Logs for First POD. *Logs for Second POD which comes up after adding the first blob to the Storage container. *YAML files attached (after removing Storage keys). *User code for BlobTrigger attached. *Events for K8se-apps namespace attached. *Keda Operator Logs attached. *Exceptions in the POD logs which is running functionhost/user code.
@pragnagopa Please help us with Triaging the issue.
@AnatoliB @fabiocav as FYI
Hi, I found following exception on the POD logs on the first one. It looks happens in here. https://github.com/Azure/azure-functions-host/blob/dev/src/WebJobs.Script/Host/Kubernetes/KubernetesClient.cs#L64 If I remember correctly, it is taking a lock and the the request is refused. That means smells of configuration or problem of the Azure Functions Host code. Probably, @divyagandhisethi might have more context on it.
The listener for function 'BlobTrigger' was unable to start.
Microsoft.Azure.WebJobs.Host.Listeners.FunctionListenerException: The listener for function 'BlobTrigger' was unable to start.
---> System.Net.Http.HttpRequestException: Connection refused
---> System.Net.Sockets.SocketException (111): Connection refused
at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean allowHttp2, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.GetHttpConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
at Microsoft.Azure.WebJobs.Script.KubernetesClient.TryAcquireLock(String lockId, String ownerId, TimeSpan lockPeriod, CancellationToken cancellationToken) in /src/azure-functions-host/src/WebJobs.Script/Host/Kubernetes/KubernetesClient.cs:line 64
at Microsoft.Azure.WebJobs.Script.KubernetesDistributedLockManager.TryLockAsync(String account, String lockId, String lockOwnerId, String proposedLeaseId, TimeSpan lockPeriod, CancellationToken cancellationToken) in /src/azure-functions-host/src/WebJobs.Script/Host/Kubernetes/KubernetesDistributedLockManager.cs:line 54
at Microsoft.Azure.WebJobs.Host.SingletonManager.TryLockAsync(String lockId, String functionInstanceId, SingletonAttribute attribute, CancellationToken cancellationToken, Boolean retry) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Singleton\SingletonManager.cs:line 113
at Microsoft.Azure.WebJobs.Host.Listeners.SingletonListener.StartAsync(CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Singleton\SingletonListener.cs:line 48
at Microsoft.Azure.WebJobs.Host.Listeners.CompositeListener.StartAsync(CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Listeners\CompositeListener.cs:line 39
at Microsoft.Azure.WebJobs.Host.Listeners.FunctionListener.StartAsync(CancellationToken cancellationToken, Boolean allowRetry) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Listeners\FunctionListener.cs:line 68
--- End of inner exception stack trace ---
we need to add try/catch
to KubernetesClient.TryAcquireLock to be consistent with BlobLeaseDistributedLockManager.TryAcquireLeaseAsync
in case we not able to acquire a look the method should return null.
Looking into it.
This issue happens because the new pod that comes up after scale up operation (0-->1 of blobTrigger) on adding a new blob in container is not able to acquire a lease. This is due to the stale lease for died out pod is still lying around. New pod thinks that some other pod is holding the lease. The old pod is not able to release the lease during its termination tenure. So either the lease duration should be reduced, so during termination of pod, lease is not renewed and because of lesser lease tenure its automatically released or other fix could be to manually release the lease during termination of a pod by handling a SIGTERM signal (but here there could be cases where cleanup itself does not happen due to a panic in the code and cleanup code is not hit)