kyma icon indicating copy to clipboard operation
kyma copied to clipboard

Function-controller's process stuck for a few minutes every few reconciled events

Open pPrecel opened this issue 2 years ago • 2 comments

Description

We've observed on a few clusters situations when the function-controller is reconciling functions for a really long time. The time is not deterministic for example on the first cluster, the controller was starting to reconcile functions every 6 minutes, but on the last cluster, this time was only 2 min. We also found clusters when reconciliation was done every 3 or 5 minutes.

This looks like this: the function-controller would receive events from Kubernetes every specific timestamp.

Example logs from cluster where functions are reconciled every 2 minutes:

...
{"level":"INFO","timestamp":"2022-09-21T15:24:51Z","logger":"controllers.function","caller":"serverless/fsm.go:67","message":"next state","context":{"kind":"Function","name":"function-musing-filip2","namespace":"default","version":1,"stateFn":"stateFnInitialize"}}
...
{"level":"INFO","timestamp":"2022-09-21T15:26:51Z","logger":"controllers.function","caller":"serverless/fsm.go:67","message":"next state","context":{"kind":"Function","name":"function-musing-filip2","namespace":"default","version":1,"stateFn":"stateFnInitialize"}}
...
{"level":"INFO","timestamp":"2022-09-21T15:28:51Z","logger":"controllers.function","caller":"serverless/fsm.go:67","message":"next state","context":{"kind":"Function","name":"function-musing-filip2","namespace":"default","version":1,"stateFn":"stateFnInitialize"}}
...
{"level":"INFO","timestamp":"2022-09-21T15:30:51Z","logger":"controllers.function","caller":"serverless/fsm.go:67","message":"next state","context":{"kind":"Function","name":"function-musing-filip2","namespace":"default","version":1,"stateFn":"stateFnInitialize"}}
...
{"level":"INFO","timestamp":"2022-09-21T15:32:51Z","logger":"controllers.function","caller":"serverless/fsm.go:67","message":"next state","context":{"kind":"Function","name":"function-musing-filip2","namespace":"default","version":1,"stateFn":"stateFnInitialize"}}
...

Expected result

The function-controller should reconcile functions without additional delay.

Actual result

The issue has symptoms where a function is jumping between conditions every random timestamp (for example 2 minutes):

After creating function

kubectl get function

NAME                     CONFIGURED   BUILT   RUNNING   RUNTIME    VERSION   AGE
function-musing-filip2                                  nodejs14   1         2s

After 2 minutes

kubectl get function

NAME                     CONFIGURED   BUILT   RUNNING   RUNTIME    VERSION   AGE
function-musing-filip2   True                           nodejs14   1         2m

After another 2 minutes

kubectl get function

NAME                     CONFIGURED   BUILT   RUNNING   RUNTIME    VERSION   AGE
function-musing-filip2   True         True              nodejs14   1         4m

And another 2 minutes

kubectl get function

NAME                     CONFIGURED   BUILT   RUNNING   RUNTIME    VERSION   AGE
function-musing-filip2   True         True    True      nodejs14   1         6m

Steps to reproduce

It's impossible to reproduce this issue on a cluster because this issue occurs randomly on a few clusters and it does not depend on cluster/kyma configuration.

Troubleshooting

From my observations killing function-controller's pod should fix the problem. It means that debugging a controller on an infected cluster is almost impossible.

pPrecel avatar Sep 22 '22 08:09 pPrecel

My observations:

I think that problem may be somewhere between our code and the library we are using to create controller and queue incoming requests. By scrapping metrics from the function-controller I observed that the controller's queue is always not empty and at the same time the workqueue_longest_running_processor_seconds metric is always increasing to the value of 120 (2min) and after that queue is decreasing and counting is started from 0 to 120 again. At the same time, when the time of the longest running processor in seconds is reset to 0, the controller is again reconciling and we can see changes in the cluster.

port-forward controller's service:

kubectl port-forward -n kyma-system svc/serverless-controller-manager 8080:8080

curl for metrics:

curl "http://localhost:8080/metrics"

...
# HELP workqueue_depth Current depth of workqueue
# TYPE workqueue_depth gauge
workqueue_depth{name="function-controller"} 8
...
# HELP workqueue_longest_running_processor_seconds How many seconds has the longest running processor for workqueue been running.
# TYPE workqueue_longest_running_processor_seconds gauge
workqueue_longest_running_processor_seconds{name="function-controller"} 98.437811049
...

Conclusions:

  • something is blocking the process and we have to figure out what is it.
  • function controller should have more logs because we don't even know if the problem is at the beginning, at the end, or during the reconciliation
  • possible reason would be health checking described here
  • second possible reason would be incoming requests filter that may block too much or may not block requests which is blocking the process (here)
  • third possible reason would be the workqueue configuration described here

pPrecel avatar Sep 22 '22 08:09 pPrecel

Screenshot 2022-09-29 at 11 27 44

kwiatekus avatar Sep 29 '22 10:09 kwiatekus

The issue is caused by this line. It occurs when the health checker sends an event and after that, it leaves without waiting for the Reconcile to respond true on the health channel (maybe because of a mix of checker timeout and Reconciler queue overload). The reconciler is trying to push true to the channel but no one is waiting for this communication.

pPrecel avatar Oct 21 '22 16:10 pPrecel