function-mesh improve the stability of autoscaling when HPA is enabled

(If this PR fixes a github issue, please add Fixes #<xyz>.)

Fixes #444

(or if this PR is one task of a github issue, please add Master Issue: #<xyz> to link to the master issue.)

Master Issue: #

Motivation

Explain here the context, and why you're making that change. What is the problem you're trying to solve.

Modifications

add spec.minReplicas to indicate the minimum number of replicas for the workloads
change spec.replicas to an optional field
do not listen for changes of statefulSet.status event anymore
optimize the conditions for determining whether a resource needs to be CREATE or UPDATE, so as to remove frequent updates without substantive changes
update CRDs (also in helm charts)

Verifying this change

[ ] Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (10MB)
Extended integration test for recovery after broker failure

Documentation

Check the box below.

Need to update docs?

[x] doc-required

(If you need help on updating docs, create a doc issue)
[ ] no-need-doc

(Please explain why)
[ ] doc

(If this PR contains doc changes)

Aug 17 '22 01:08 tpiperatgod

Simple case:

Create Function with a configuration like below:

spec:
  minReplicas: 1
  maxReplicas: 10
  pod:
    autoScalingMetrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 80
    autoScalingBehavior:
      scaleDown:
        stabilizationWindowSeconds: 300
        policies:
        - type: Percent
          value: 100
          periodSeconds: 15
      scaleUp:
        stabilizationWindowSeconds: 60
        policies:
        - type: Percent
          value: 100
          periodSeconds: 15
        - type: Pods
          value: 4
          periodSeconds: 15
        selectPolicy: Max

Before the first period of the HPA begins, Function's spec.replicas is the same as spec.minReplicas and is passed to StatefulSet's spec.replicas
When HPA triggers autoscaling, such as scaling Function to 2 replicas
Function's spec.replicas changed to 2 by HPA
The increase of Function Generations in the Reconciliation logic triggers an update of the StatefulSet, changing its spec.replicas to 2
At this point, if you apply the Function again (using the configuration above), the Function does not trigger a new Reconciliation process

Aug 17 '22 02:08 tpiperatgod

We have deployed the code in the branch this is the error message we get

Aug 17 '22 18:08 armangurkan

We have deployed the code in the branch this is the error message we get

This error means that the previous update has not been completed in this reconciliation. In most cases, this can be considered a warning.

Can you describe what resources are currently not working as expected? and how? And can you paste the current configurations?

Aug 18 '22 01:08 tpiperatgod

We have deployed the code in the branch this is the error message we get

This error means that the previous update has not been completed in this reconciliation. In most cases, this can be considered a warning.

Can you describe what resources are currently not working as expected? and how? And can you paste the current configurations?

What we had experience was, statefulset replicas got down to 0 constantly, we observed a constant cycle of shutdown and re-instantiation of pods, so at first we thought it was actually processing data, and it was being fired every time there was a new message in the topic they were subscribed to, to test it we shot down all the traffic, and that behavior was still in place, then I checked the function mesh operator logs and saw this error message.

Aug 18 '22 06:08 armangurkan

What we had experience was, statefulset replicas got down to 0 constantly, we observed a constant cycle of shutdown and re-instantiation of pods, so at first we thought it was actually processing data, and it was being fired every time there was a new message in the topic they were subscribed to, to test it we shot down all the traffic, and that behavior was still in place, then I checked the function mesh operator logs and saw this error message.

This is consistent with what I observed prior to fixing this issue, which was mainly caused by:

FunctionMesh listens to too many events (HPA, StatefulSet), including events generated by changes in HPA.status and StatefulSet.status, which constantly trigger new reconciliation processes, and changes generated by new reconciliations will continue to trigger new reconciliations...
There is currently an issue with FunctionMesh, when Function/Sink/Source is automatically scaled by HPA, all pods are rescheduled every time, which will cause the metrics (CPU\MEMORY) to go up over time and HPA will change the number of replicas of Function/Sink/Source frequently.

For reason 1, I filtered the events generated by StatefulSet.status in this PR and refined the reconciliation trigger conditions for FunctionMesh.

For reason 2, I suggest to increase the stabilizationWindowSeconds of HPA, ref to https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#stabilization-window

Can you pates the configuration of spec.pod.autoScalingBehavior and spec.pod.autoScalingMetrics?

And for resources that don't need HPA, you can remove the spec.maxReplicas.

Aug 18 '22 07:08 tpiperatgod

What we had experience was, statefulset replicas got down to 0 constantly, we observed a constant cycle of shutdown and re-instantiation of pods, so at first we thought it was actually processing data, and it was being fired every time there was a new message in the topic they were subscribed to, to test it we shot down all the traffic, and that behavior was still in place, then I checked the function mesh operator logs and saw this error message.

This is consistent with what I observed prior to fixing this issue, which was mainly caused by:

FunctionMesh listens to too many events (HPA, StatefulSet), including events generated by changes in HPA.status and StatefulSet.status, which constantly trigger new reconciliation processes, and changes generated by new reconciliations will continue to trigger new reconciliations...

There is currently an issue with FunctionMesh, when Function/Sink/Source is automatically scaled by HPA, all pods are rescheduled every time, which will cause the metrics (CPU\MEMORY) to go up over time and HPA will change the number of replicas of Function/Sink/Source frequently.

For reason 1, I filtered the events generated by StatefulSet.status in this PR and refined the reconciliation trigger conditions for FunctionMesh.

For reason 2, I suggest to increase the stabilizationWindowSeconds of HPA, ref to https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#stabilization-window

Can you pates the configuration of spec.pod.autoScalingBehavior and spec.pod.autoScalingMetrics?

And for resources that don't need HPA, you can remove the spec.maxReplicas. At this point all our functions and sinks need scaling in our function mesh.

        autoScalingMetrics:
        - type: Resource
          resource:
            name: memory
            target:
              type: Utilization
              averageUtilization: 60

This is currently it but we can define autoscaling behavior if necessary. So do you have benchmarks in mind so we can give it a shot on scaling up and scaling down params?

Aug 18 '22 16:08 armangurkan

        autoScalingMetrics:
        - type: Resource
          resource:
            name: memory
            target:
              type: Utilization
              averageUtilization: 60
This is currently it but we can define autoscaling behavior if necessary. So do you have benchmarks in mind so we can give it a shot on scaling up and scaling down params?

My preferred configuration is as follows.

  pod:
    autoScalingMetrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 80 # I have observed that the cpu usage of the pod stays around 57% when idle
    autoScalingBehavior:
      scaleDown:
        stabilizationWindowSeconds: 120
        policies:
        - type: Percent
          value: 100
          periodSeconds: 15
      scaleUp:
        stabilizationWindowSeconds: 120
        policies:
        - type: Percent
          value: 50
          periodSeconds: 15
        - type: Pods
          value: 2
          periodSeconds: 15
        selectPolicy: Max

By default, autoScalingBehavior.scaleUp.stabilizationWindowSeconds has a value of 0, which means that when hpa's scaling algorithm meets a condition (e.g., cpu usage exceeds 80%), then hpa will immediately increase the number of pods. Because of this issue, every time the number of replicas of workloads changes, all pods are rescheduled (i.e., rebuilt), which causes the cpu usage metric to skyrocket, which in turn continues to increase the expected replica value of hpa.

I suggest setting autoScalingBehavior.scaleUp.stabilizationWindowSeconds to 120 or a reasonable range, which will make hpa less sensitive.

In addition, configurations like autoScalingBehavior.scaleDown.policies and autoScalingBehavior.scaleUp.policies can also be used to control the magnitude of scaling.

Aug 19 '22 02:08 tpiperatgod

add spec.minReplicas to indicate the minimum number of replicas for the workloads change spec.replicas to an optional field

why not use spec.replicas as the default minReplicas?

spec.relicas should serve as the initial value, and the following invariant must hold: "minRelicas <= replicas <= maxRelicas"

Aug 22 '22 16:08 nlu90

add spec.minReplicas to indicate the minimum number of replicas for the workloads change spec.replicas to an optional field

why not use spec.replicas as the default minReplicas?

spec.relicas should serve as the initial value, and the following invariant must hold: "minRelicas <= replicas <= maxRelicas"

This flow is not working with ArgoCD. HPA can't scale when statefulset has "replica" count. To work with ArgoCD the replicas should be empty and then HPA can scale the StatefulSet. You can refer to this documentation for more detailed information.

Aug 22 '22 20:08 alperencelik

        autoScalingMetrics:
        - type: Resource
          resource:
            name: memory
            target:
              type: Utilization
              averageUtilization: 60
This is currently it but we can define autoscaling behavior if necessary. So do you have benchmarks in mind so we can give it a shot on scaling up and scaling down params?
My preferred configuration is as follows.
  pod:
    autoScalingMetrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 80 # I have observed that the cpu usage of the pod stays around 57% when idle
    autoScalingBehavior:
      scaleDown:
        stabilizationWindowSeconds: 120
        policies:
        - type: Percent
          value: 100
          periodSeconds: 15
      scaleUp:
        stabilizationWindowSeconds: 120
        policies:
        - type: Percent
          value: 50
          periodSeconds: 15
        - type: Pods
          value: 2
          periodSeconds: 15
        selectPolicy: Max
By default, autoScalingBehavior.scaleUp.stabilizationWindowSeconds has a value of 0, which means that when hpa's scaling algorithm meets a condition (e.g., cpu usage exceeds 80%), then hpa will immediately increase the number of pods. Because of this issue, every time the number of replicas of workloads changes, all pods are rescheduled (i.e., rebuilt), which causes the cpu usage metric to skyrocket, which in turn continues to increase the expected replica value of hpa.

I suggest setting autoScalingBehavior.scaleUp.stabilizationWindowSeconds to 120 or a reasonable range, which will make hpa less sensitive.

In addition, configurations like autoScalingBehavior.scaleDown.policies and autoScalingBehavior.scaleUp.policies can also be used to control the magnitude of scaling.

Seems like setting autoScalingBehavior is fixed the issue mentioned above but as you mentioned at this time when HPA scales downward or upward all pods are rescheduling. I know you guys put a lot of effort into this work and I am grateful to you but also I'm very excited to see when we'll have this feature on upcoming release.

Aug 22 '22 20:08 alperencelik

why not use spec.replicas as the default minReplicas?

spec.relicas should serve as the initial value, and the following invariant must hold: "minRelicas <= replicas <= maxRelicas"

@nlu90 The targetRef of HPA is Function/Sink/Source, so when HPA is enabled, spec.replicas of Function/Sink/Source will be controlled by HPA. So we need to add a new field that is not affected by HPA, i.e. spec.minReplicas.

Also, webhook will help user to maintain this rule: minRelicas <= replicas <= maxRelicas and set initial values for spec.replicas and spec.minReplicas when they are empty.

Seems like setting autoScalingBehavior is fixed the issue mentioned above but as you mentioned at this time when HPA scales downward or upward all pods are rescheduling. I know you guys put a lot of effort into this work and I am grateful to you but also I'm very excited to see when we'll have this feature on upcoming release.

@alperencelik Thanks. By milestone, this change will be included in the mid-September release, v0.6.0.

Aug 23 '22 03:08 tpiperatgod

We have deployed the code in the branch this is the error message we get

@tpiperatgod , somehow the issue is back, the functions can not scale (so in fact they got deleted as soon as they scale, we observe this, in the logs of kubecontroller manager where every successful create is followed by a succesful delete), the previous error of the function controller is thrown frequently, also in the kube scheduler, we observe that volume mounts is erroring out constantly for the new pods that are being created, as follows

E0827 00:04:44.373395       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 00:04:44.373456       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 00:04:44.389374       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 00:33:34.582859       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-1"
E0827 00:33:34.582923       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-1"
E0827 00:35:21.047823       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 00:35:21.047877       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:39:13.643549       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:39:13.643628       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
I0827 09:39:13.643661       1 factory.go:231] "Pod some-function-function-0\" not found"
E0827 09:39:13.659803       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 09:53:04.134059       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:53:04.134104       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:53:04.171206       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 09:56:33.602585       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:56:33.602650       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:56:33.637281       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 10:43:47.118786       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 10:43:47.118845       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
I0827 10:43:47.118864       1 factory.go:231] "Pod some-function-function-0\" not found"
E0827 10:43:47.151607       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 11:10:05.507240       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 11:10:05.507312       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 11:10:05.542283       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 16:50:17.789041       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, [invalid bearer token, serviceaccounts \"kube-prometheus-stack-in-c-prometheus\" not found]]"
E0827 17:36:47.796489       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, [invalid bearer token, serviceaccounts \"kube-prometheus-stack-in-c-prometheus\" not found]]"
E0827 23:07:54.580377       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:07:54.580431       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
I0827 23:07:54.580447       1 factory.go:231] "Pod some-function-function-0\" not found"
E0827 23:07:54.606340       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 23:19:56.126715       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:19:56.126767       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:19:56.133013       1 event_broadcaster.go:253] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"fm-some-function-function-0.170f56f49cdd8c10" is invalid: series.count: Invalid value: "": should be at least 2' (will not retry!)
E0827 23:41:13.876819       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:41:13.876860       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
I0827 23:41:13.876873       1 factory.go:231] "Pod some-function-function-0\" not found"
E0827 23:41:13.878275       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 23:41:44.039050       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:41:44.039089       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 00:46:03.242083       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 00:46:03.242149       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 00:46:03.254107       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:06:50.788980       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:06:50.789034       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:06:50.801204       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:09:21.319015       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:09:21.319052       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:09:22.834663       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:09:22.834705       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:09:22.868887       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:15:37.074735       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:15:37.074813       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:15:37.076709       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:25:53.094399       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:25:53.094457       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:25:53.119481       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:47:17.269534       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-1"
E0828 01:47:17.269590       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-1"
E0828 03:41:43.951020       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 03:41:43.951061       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 03:41:43.969157       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 03:49:44.570421       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 03:49:44.570477       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 03:51:29.845382       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 03:51:29.845430       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 03:51:29.870361       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 04:40:50.285779       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 04:40:50.285822       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 04:40:50.298786       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 05:40:10.830900       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 05:40:10.830939       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
I0828 05:40:10.830951       1 factory.go:231] "Pod some-function-function-3\" not found"
E0828 05:40:10.832292       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 08:08:26.497729       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 08:08:26.497986       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 08:08:26.506780       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 09:06:16.771165       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 09:06:16.771209       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 10:23:53.929578       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 10:23:53.929654       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 10:23:53.934027       1 event_broadcaster.go:253] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"fm-some-function-function-3.170f7b3012f49f86" is invalid: series.count: Invalid value: "": should be at least 2' (will not retry!)
E0828 10:45:10.646185       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 10:45:10.646229       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 11:23:59.037715       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 11:23:59.037777       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 12:01:47.955955       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 12:01:47.956012       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 12:01:47.993038       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 12:14:02.968369       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 12:14:02.968418       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
I0828 12:14:02.968432       1 factory.go:231] "Pod some-function-function-3\" not found"
E0828 12:14:02.969748       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 13:00:41.608169       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 13:00:41.608220       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 13:21:23.711091       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 13:21:23.711137       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 15:14:18.279071       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 15:14:18.279115       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 16:03:36.962893       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 16:03:36.962960       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"

Also please refer to the error event

Aug 28 '22 19:08 armangurkan

Hi @armangurkan, can you show the status of the HPA? I have tested it in my cluster and I found that the cause is still the HPA trigger condition.

And I don't think this PR will solve the root cause, this will only improve the stability of the reconciliation (it also depends on the configuration of the HPA of the function).

Aug 29 '22 00:08 tpiperatgod

@tpiperatgod The weird thing is as a result of colliding deployments the autoscaling worked for some reason and I tried every combination that can be possible and can not get it to a working state. Do you have a slack channel that you guys can invite me to. Also this feature is very important for us and we are happy to contribute as a team as we can get into the details of the project over the slack channel.

Aug 29 '22 01:08 armangurkan

@tpiperatgod The weird thing is as a result of colliding deployments the autoscaling worked for some reason and I tried every combination that can be possible and can not get it to a working state. Do you have a slack channel that you guys can invite me to. Also this feature is very important for us and we are happy to contribute as a team as we can get into the details of the project over the slack channel.

Welcome, please wait for me to find a suitable slack channel.

you can also join this: https://pulsar.apache.org/community/

Aug 29 '22 03:08 tpiperatgod