autoscaling icon indicating copy to clipboard operation
autoscaling copied to clipboard

Bug: Scheduler metrics can show buffer when there isn't any

Open sharnoff opened this issue 8 months ago • 1 comments

Environment

Prod (eu-central-1)

Steps to reproduce

Unknown

Expected result

The node metrics reported by the scheduler should always match its internal state.

Actual result

The scheduler showed an extended period of non-zero buffer CPU/memory, even though the state dump showed no nodes with non-zero buffer CPU or memory (gist; collected by ntk a scheduler-state).

Screenshot of graphs in grafana, showing an initial spike in buffer CPU and memory, followed by a flat non-zero line of the buffer amount from 23:03 to 23:57

The scheduler was restarted at about 2023-12-05 23:57, at which point the issue was resolved.


Note that this was happening in at the same time as some other node-level disruptions on the same kubernetes node that the scheduler was on (slack link for more), so this there may have been some related impact there.

Other logs, links

  • ...

sharnoff avatar Dec 06 '23 03:12 sharnoff