function-mesh
function-mesh copied to clipboard
improve the stability of autoscaling when HPA is enabled
(If this PR fixes a github issue, please add Fixes #<xyz>
.)
Fixes #444
(or if this PR is one task of a github issue, please add Master Issue: #<xyz>
to link to the master issue.)
Master Issue: #
Motivation
Explain here the context, and why you're making that change. What is the problem you're trying to solve.
Modifications
- add
spec.minReplicas
to indicate the minimum number of replicas for the workloads - change
spec.replicas
to an optional field - do not listen for changes of
statefulSet.status
event anymore - optimize the conditions for determining whether a resource needs to be CREATE or UPDATE, so as to remove frequent updates without substantive changes
- update CRDs (also in helm charts)
Verifying this change
- [ ] Make sure that the change passes the CI checks.
(Please pick either of the following options)
This change is a trivial rework / code cleanup without any test coverage.
(or)
This change is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
- Added integration tests for end-to-end deployment with large payloads (10MB)
- Extended integration test for recovery after broker failure
Documentation
Check the box below.
Need to update docs?
-
[x]
doc-required
(If you need help on updating docs, create a doc issue)
-
[ ]
no-need-doc
(Please explain why)
-
[ ]
doc
(If this PR contains doc changes)
Simple case:
- Create Function with a configuration like below:
spec:
minReplicas: 1
maxReplicas: 10
pod:
autoScalingMetrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
autoScalingBehavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
- Before the first period of the HPA begins, Function's
spec.replicas
is the same asspec.minReplicas
and is passed to StatefulSet'sspec.replicas
- When HPA triggers autoscaling, such as scaling Function to
2
replicas - Function's
spec.replicas
changed to2
by HPA - The increase of Function Generations in the Reconciliation logic triggers an update of the StatefulSet, changing its
spec.replicas
to2
- At this point, if you apply the Function again (using the configuration above), the Function does not trigger a new Reconciliation process
We have deployed the code in the branch this is the error message we get
We have deployed the code in the branch this is the error message we get
This error means that the previous update has not been completed in this reconciliation. In most cases, this can be considered a warning.
Can you describe what resources are currently not working as expected? and how? And can you paste the current configurations?
We have deployed the code in the branch this is the error message we get
This error means that the previous update has not been completed in this reconciliation. In most cases, this can be considered a warning.
Can you describe what resources are currently not working as expected? and how? And can you paste the current configurations?
What we had experience was, statefulset replicas got down to 0 constantly, we observed a constant cycle of shutdown and re-instantiation of pods, so at first we thought it was actually processing data, and it was being fired every time there was a new message in the topic they were subscribed to, to test it we shot down all the traffic, and that behavior was still in place, then I checked the function mesh operator logs and saw this error message.
What we had experience was, statefulset replicas got down to 0 constantly, we observed a constant cycle of shutdown and re-instantiation of pods, so at first we thought it was actually processing data, and it was being fired every time there was a new message in the topic they were subscribed to, to test it we shot down all the traffic, and that behavior was still in place, then I checked the function mesh operator logs and saw this error message.
This is consistent with what I observed prior to fixing this issue, which was mainly caused by:
- FunctionMesh listens to too many events (HPA, StatefulSet), including events generated by changes in HPA.status and StatefulSet.status, which constantly trigger new reconciliation processes, and changes generated by new reconciliations will continue to trigger new reconciliations...
- There is currently an issue with FunctionMesh, when Function/Sink/Source is automatically scaled by HPA, all pods are rescheduled every time, which will cause the metrics (CPU\MEMORY) to go up over time and HPA will change the number of replicas of Function/Sink/Source frequently.
For reason 1, I filtered the events generated by StatefulSet.status in this PR and refined the reconciliation trigger conditions for FunctionMesh.
For reason 2, I suggest to increase the stabilizationWindowSeconds
of HPA, ref to https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#stabilization-window
Can you pates the configuration of spec.pod.autoScalingBehavior
and spec.pod.autoScalingMetrics
?
And for resources that don't need HPA, you can remove the spec.maxReplicas
.
What we had experience was, statefulset replicas got down to 0 constantly, we observed a constant cycle of shutdown and re-instantiation of pods, so at first we thought it was actually processing data, and it was being fired every time there was a new message in the topic they were subscribed to, to test it we shot down all the traffic, and that behavior was still in place, then I checked the function mesh operator logs and saw this error message.
This is consistent with what I observed prior to fixing this issue, which was mainly caused by:
- FunctionMesh listens to too many events (HPA, StatefulSet), including events generated by changes in HPA.status and StatefulSet.status, which constantly trigger new reconciliation processes, and changes generated by new reconciliations will continue to trigger new reconciliations...
- There is currently an issue with FunctionMesh, when Function/Sink/Source is automatically scaled by HPA, all pods are rescheduled every time, which will cause the metrics (CPU\MEMORY) to go up over time and HPA will change the number of replicas of Function/Sink/Source frequently.
For reason 1, I filtered the events generated by StatefulSet.status in this PR and refined the reconciliation trigger conditions for FunctionMesh.
For reason 2, I suggest to increase the
stabilizationWindowSeconds
of HPA, ref to https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#stabilization-windowCan you pates the configuration of
spec.pod.autoScalingBehavior
andspec.pod.autoScalingMetrics
?And for resources that don't need HPA, you can remove the
spec.maxReplicas
. At this point all our functions and sinks need scaling in our function mesh.
autoScalingMetrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 60
This is currently it but we can define autoscaling behavior if necessary. So do you have benchmarks in mind so we can give it a shot on scaling up and scaling down params?
autoScalingMetrics: - type: Resource resource: name: memory target: type: Utilization averageUtilization: 60
This is currently it but we can define autoscaling behavior if necessary. So do you have benchmarks in mind so we can give it a shot on scaling up and scaling down params?
My preferred configuration is as follows.
pod:
autoScalingMetrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80 # I have observed that the cpu usage of the pod stays around 57% when idle
autoScalingBehavior:
scaleDown:
stabilizationWindowSeconds: 120
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleUp:
stabilizationWindowSeconds: 120
policies:
- type: Percent
value: 50
periodSeconds: 15
- type: Pods
value: 2
periodSeconds: 15
selectPolicy: Max
By default, autoScalingBehavior.scaleUp.stabilizationWindowSeconds
has a value of 0, which means that when hpa's scaling algorithm meets a condition (e.g., cpu usage exceeds 80%), then hpa will immediately increase the number of pods. Because of this issue, every time the number of replicas of workloads changes, all pods are rescheduled (i.e., rebuilt), which causes the cpu usage metric to skyrocket, which in turn continues to increase the expected replica value of hpa.
I suggest setting autoScalingBehavior.scaleUp.stabilizationWindowSeconds
to 120 or a reasonable range, which will make hpa less sensitive.
In addition, configurations like autoScalingBehavior.scaleDown.policies
and autoScalingBehavior.scaleUp.policies
can also be used to control the magnitude of scaling.
add spec.minReplicas to indicate the minimum number of replicas for the workloads change spec.replicas to an optional field
why not use spec.replicas
as the default minReplicas?
spec.relicas
should serve as the initial value, and the following invariant must hold: "minRelicas <= replicas <= maxRelicas"
add spec.minReplicas to indicate the minimum number of replicas for the workloads change spec.replicas to an optional field
why not use
spec.replicas
as the default minReplicas?
spec.relicas
should serve as the initial value, and the following invariant must hold: "minRelicas <= replicas <= maxRelicas"
This flow is not working with ArgoCD. HPA can't scale when statefulset has "replica" count. To work with ArgoCD the replicas should be empty and then HPA can scale the StatefulSet. You can refer to this documentation for more detailed information.
autoScalingMetrics: - type: Resource resource: name: memory target: type: Utilization averageUtilization: 60
This is currently it but we can define autoscaling behavior if necessary. So do you have benchmarks in mind so we can give it a shot on scaling up and scaling down params?
My preferred configuration is as follows.
pod: autoScalingMetrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 80 # I have observed that the cpu usage of the pod stays around 57% when idle autoScalingBehavior: scaleDown: stabilizationWindowSeconds: 120 policies: - type: Percent value: 100 periodSeconds: 15 scaleUp: stabilizationWindowSeconds: 120 policies: - type: Percent value: 50 periodSeconds: 15 - type: Pods value: 2 periodSeconds: 15 selectPolicy: Max
By default,
autoScalingBehavior.scaleUp.stabilizationWindowSeconds
has a value of 0, which means that when hpa's scaling algorithm meets a condition (e.g., cpu usage exceeds 80%), then hpa will immediately increase the number of pods. Because of this issue, every time the number of replicas of workloads changes, all pods are rescheduled (i.e., rebuilt), which causes the cpu usage metric to skyrocket, which in turn continues to increase the expected replica value of hpa.I suggest setting
autoScalingBehavior.scaleUp.stabilizationWindowSeconds
to 120 or a reasonable range, which will make hpa less sensitive.In addition, configurations like
autoScalingBehavior.scaleDown.policies
andautoScalingBehavior.scaleUp.policies
can also be used to control the magnitude of scaling.
Seems like setting autoScalingBehavior is fixed the issue mentioned above but as you mentioned at this time when HPA scales downward or upward all pods are rescheduling. I know you guys put a lot of effort into this work and I am grateful to you but also I'm very excited to see when we'll have this feature on upcoming release.
why not use
spec.replicas
as the default minReplicas?
spec.relicas
should serve as the initial value, and the following invariant must hold: "minRelicas <= replicas <= maxRelicas"
@nlu90
The targetRef
of HPA is Function/Sink/Source, so when HPA is enabled, spec.replicas
of Function/Sink/Source will be controlled by HPA. So we need to add a new field that is not affected by HPA, i.e. spec.minReplicas
.
Also, webhook will help user to maintain this rule: minRelicas <= replicas <= maxRelicas
and set initial values for spec.replicas
and spec.minReplicas
when they are empty.
Seems like setting autoScalingBehavior is fixed the issue mentioned above but as you mentioned at this time when HPA scales downward or upward all pods are rescheduling. I know you guys put a lot of effort into this work and I am grateful to you but also I'm very excited to see when we'll have this feature on upcoming release.
@alperencelik Thanks. By milestone, this change will be included in the mid-September release, v0.6.0.
We have deployed the code in the branch this is the error message we get
@tpiperatgod , somehow the issue is back, the functions can not scale (so in fact they got deleted as soon as they scale, we observe this, in the logs of kubecontroller manager where every successful create is followed by a succesful delete), the previous error of the function controller is thrown frequently, also in the kube scheduler, we observe that volume mounts is erroring out constantly for the new pods that are being created, as follows
E0827 00:04:44.373395 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 00:04:44.373456 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 00:04:44.389374 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 00:33:34.582859 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-1"
E0827 00:33:34.582923 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-1"
E0827 00:35:21.047823 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 00:35:21.047877 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:39:13.643549 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:39:13.643628 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
I0827 09:39:13.643661 1 factory.go:231] "Pod some-function-function-0\" not found"
E0827 09:39:13.659803 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 09:53:04.134059 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:53:04.134104 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:53:04.171206 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 09:56:33.602585 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:56:33.602650 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:56:33.637281 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 10:43:47.118786 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 10:43:47.118845 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
I0827 10:43:47.118864 1 factory.go:231] "Pod some-function-function-0\" not found"
E0827 10:43:47.151607 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 11:10:05.507240 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 11:10:05.507312 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 11:10:05.542283 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 16:50:17.789041 1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, [invalid bearer token, serviceaccounts \"kube-prometheus-stack-in-c-prometheus\" not found]]"
E0827 17:36:47.796489 1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, [invalid bearer token, serviceaccounts \"kube-prometheus-stack-in-c-prometheus\" not found]]"
E0827 23:07:54.580377 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:07:54.580431 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
I0827 23:07:54.580447 1 factory.go:231] "Pod some-function-function-0\" not found"
E0827 23:07:54.606340 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 23:19:56.126715 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:19:56.126767 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:19:56.133013 1 event_broadcaster.go:253] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"fm-some-function-function-0.170f56f49cdd8c10" is invalid: series.count: Invalid value: "": should be at least 2' (will not retry!)
E0827 23:41:13.876819 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:41:13.876860 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
I0827 23:41:13.876873 1 factory.go:231] "Pod some-function-function-0\" not found"
E0827 23:41:13.878275 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 23:41:44.039050 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:41:44.039089 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 00:46:03.242083 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 00:46:03.242149 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 00:46:03.254107 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:06:50.788980 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:06:50.789034 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:06:50.801204 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:09:21.319015 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:09:21.319052 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:09:22.834663 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:09:22.834705 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:09:22.868887 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:15:37.074735 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:15:37.074813 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:15:37.076709 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:25:53.094399 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:25:53.094457 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:25:53.119481 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:47:17.269534 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-1"
E0828 01:47:17.269590 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-1"
E0828 03:41:43.951020 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 03:41:43.951061 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 03:41:43.969157 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 03:49:44.570421 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 03:49:44.570477 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 03:51:29.845382 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 03:51:29.845430 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 03:51:29.870361 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 04:40:50.285779 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 04:40:50.285822 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 04:40:50.298786 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 05:40:10.830900 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 05:40:10.830939 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
I0828 05:40:10.830951 1 factory.go:231] "Pod some-function-function-3\" not found"
E0828 05:40:10.832292 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 08:08:26.497729 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 08:08:26.497986 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 08:08:26.506780 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 09:06:16.771165 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 09:06:16.771209 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 10:23:53.929578 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 10:23:53.929654 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 10:23:53.934027 1 event_broadcaster.go:253] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"fm-some-function-function-3.170f7b3012f49f86" is invalid: series.count: Invalid value: "": should be at least 2' (will not retry!)
E0828 10:45:10.646185 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 10:45:10.646229 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 11:23:59.037715 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 11:23:59.037777 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 12:01:47.955955 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 12:01:47.956012 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 12:01:47.993038 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 12:14:02.968369 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 12:14:02.968418 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
I0828 12:14:02.968432 1 factory.go:231] "Pod some-function-function-3\" not found"
E0828 12:14:02.969748 1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 13:00:41.608169 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 13:00:41.608220 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 13:21:23.711091 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 13:21:23.711137 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 15:14:18.279071 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 15:14:18.279115 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 16:03:36.962893 1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 16:03:36.962960 1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
Also please refer to the error event
Hi @armangurkan, can you show the status of the HPA? I have tested it in my cluster and I found that the cause is still the HPA trigger condition.
And I don't think this PR will solve the root cause, this will only improve the stability of the reconciliation (it also depends on the configuration of the HPA of the function).
@tpiperatgod The weird thing is as a result of colliding deployments the autoscaling worked for some reason and I tried every combination that can be possible and can not get it to a working state. Do you have a slack channel that you guys can invite me to. Also this feature is very important for us and we are happy to contribute as a team as we can get into the details of the project over the slack channel.
@tpiperatgod The weird thing is as a result of colliding deployments the autoscaling worked for some reason and I tried every combination that can be possible and can not get it to a working state. Do you have a slack channel that you guys can invite me to. Also this feature is very important for us and we are happy to contribute as a team as we can get into the details of the project over the slack channel.
Welcome, please wait for me to find a suitable slack channel.
you can also join this: https://pulsar.apache.org/community/