kyma
kyma copied to clipboard
Function-controller's process stuck for a few minutes every few reconciled events
Description
We've observed on a few clusters situations when the function-controller
is reconciling functions for a really long time. The time is not deterministic for example on the first cluster, the controller was starting to reconcile functions every 6 minutes, but on the last cluster, this time was only 2 min. We also found clusters when reconciliation was done every 3 or 5 minutes.
This looks like this: the function-controller
would receive events from Kubernetes
every specific timestamp.
Example logs from cluster where functions are reconciled every 2 minutes:
...
{"level":"INFO","timestamp":"2022-09-21T15:24:51Z","logger":"controllers.function","caller":"serverless/fsm.go:67","message":"next state","context":{"kind":"Function","name":"function-musing-filip2","namespace":"default","version":1,"stateFn":"stateFnInitialize"}}
...
{"level":"INFO","timestamp":"2022-09-21T15:26:51Z","logger":"controllers.function","caller":"serverless/fsm.go:67","message":"next state","context":{"kind":"Function","name":"function-musing-filip2","namespace":"default","version":1,"stateFn":"stateFnInitialize"}}
...
{"level":"INFO","timestamp":"2022-09-21T15:28:51Z","logger":"controllers.function","caller":"serverless/fsm.go:67","message":"next state","context":{"kind":"Function","name":"function-musing-filip2","namespace":"default","version":1,"stateFn":"stateFnInitialize"}}
...
{"level":"INFO","timestamp":"2022-09-21T15:30:51Z","logger":"controllers.function","caller":"serverless/fsm.go:67","message":"next state","context":{"kind":"Function","name":"function-musing-filip2","namespace":"default","version":1,"stateFn":"stateFnInitialize"}}
...
{"level":"INFO","timestamp":"2022-09-21T15:32:51Z","logger":"controllers.function","caller":"serverless/fsm.go:67","message":"next state","context":{"kind":"Function","name":"function-musing-filip2","namespace":"default","version":1,"stateFn":"stateFnInitialize"}}
...
Expected result
The function-controller
should reconcile functions without additional delay.
Actual result
The issue has symptoms where a function is jumping between conditions every random timestamp (for example 2 minutes):
After creating function
kubectl get function
NAME CONFIGURED BUILT RUNNING RUNTIME VERSION AGE
function-musing-filip2 nodejs14 1 2s
After 2 minutes
kubectl get function
NAME CONFIGURED BUILT RUNNING RUNTIME VERSION AGE
function-musing-filip2 True nodejs14 1 2m
After another 2 minutes
kubectl get function
NAME CONFIGURED BUILT RUNNING RUNTIME VERSION AGE
function-musing-filip2 True True nodejs14 1 4m
And another 2 minutes
kubectl get function
NAME CONFIGURED BUILT RUNNING RUNTIME VERSION AGE
function-musing-filip2 True True True nodejs14 1 6m
Steps to reproduce
It's impossible to reproduce this issue on a cluster because this issue occurs randomly on a few clusters and it does not depend on cluster/kyma configuration.
Troubleshooting
From my observations killing function-controller's pod should fix the problem. It means that debugging a controller on an infected cluster is almost impossible.
My observations:
I think that problem may be somewhere between our code and the library we are using to create controller and queue incoming requests. By scrapping metrics from the function-controller
I observed that the controller's queue is always not empty and at the same time the workqueue_longest_running_processor_seconds
metric is always increasing to the value of 120
(2min) and after that queue is decreasing and counting is started from 0 to 120 again. At the same time, when the time of the longest running processor in seconds is reset to 0, the controller is again reconciling and we can see changes in the cluster.
port-forward controller's service:
kubectl port-forward -n kyma-system svc/serverless-controller-manager 8080:8080
curl for metrics:
curl "http://localhost:8080/metrics"
...
# HELP workqueue_depth Current depth of workqueue
# TYPE workqueue_depth gauge
workqueue_depth{name="function-controller"} 8
...
# HELP workqueue_longest_running_processor_seconds How many seconds has the longest running processor for workqueue been running.
# TYPE workqueue_longest_running_processor_seconds gauge
workqueue_longest_running_processor_seconds{name="function-controller"} 98.437811049
...
Conclusions:
- something is blocking the process and we have to figure out what is it.
- function controller should have more logs because we don't even know if the problem is at the beginning, at the end, or during the reconciliation
- possible reason would be health checking described here
- second possible reason would be incoming requests filter that may block too much or may not block requests which is blocking the process (here)
- third possible reason would be the
workqueue
configuration described here
The issue is caused by this line. It occurs when the health checker sends an event and after that, it leaves without waiting for the Reconcile to respond true
on the health channel (maybe because of a mix of checker timeout and Reconciler queue overload). The reconciler is trying to push true
to the channel but no one is waiting for this communication.