autoscaling Bug: scheduler has negative "buffer" value

Bug: scheduler has negative "buffer" value

Open sharnoff opened this issue 5 months ago • 1 comments

Environment

Prod (occurred twice recently)

Steps to reproduce

Not yet clear. Here's an example:

{"level":"info","ts":1709922373.111944,"logger":"autoscale-scheduler","caller":"plugin/state.go:1379","msg":"Adding VM pod to node","action":"read cluster state","virtualmachine":{"namespace":"default","name":"compute-falling-cake-a6d84vya"},"pod":{"namespace":"default","name":"compute-falling-cake-a6d84vya-dv647"},"node":"i-0d216a75a106c181d.us-west-2.compute.internal","verdict":{"cpu":"pod = 0.25/0.25 (node 14.25 -> 14.5 / 127.61, 0 -> 4.294967046e+06 buffer)","mem":"pod = 1Gi/1Gi (node 57Gi -> 58Gi / 519497968Ki, 0 -> -1Gi buffer"}}

I think it's entirely caused by faulty logic in (*AutoscaleEnforcer).readClusterState(), but haven't looked into it thoroughly.

And tbh, it's a little weird that readClusterState has its own implementation of reserve logic, rather than using the shared version that was added in #666.

Expected result

Any buffer value from adding a VM should be non-negative.

Actual result

The memory "buffer" value was negative (see: -1Gi buffer), and the value for CPU underflowed.

Other logs, links

Mar 09 '24 06:03 sharnoff

autoscaling autoscaling copied to clipboard

Bug: scheduler has negative "buffer" value

Environment

Steps to reproduce

Expected result

Actual result

Other logs, links

autoscaling
autoscaling copied to clipboard