autoscaling icon indicating copy to clipboard operation
autoscaling copied to clipboard

agent/core: Treat failed requests as potentially successful

Open sharnoff opened this issue 7 months ago • 0 comments

Fixes #680, see there for detail on motivation. tl;dr: this fixes a known category of bugs, and AFAICT is a pre-requisite for using the VM spec as a source of truth.

Brief summary of changes:

  • Introduce a new resourceBounds struct in pkg/agent/core that handles the uncertainty associated with requests that may or may not have succeeded.
  • Switch internal usage so plugin permit, vm-monitor approved, and VM spec resources all are represented by resourceBounds
  • Add a new test to extensively test this (TestFailuresNotAssumedSuccessful)

I expect we'll find bugs with this in production. Most of those should be fine - restarting the pkg/agent.Runner and retrying with a fresh slate. Possible liveness issues would be more concerning (e.g. getting into a state where we stop communicating with other components). Those should hopefully be handled by the new test.


Notes for review: Keeping it marked as a draft for now — want to first validate that this is a workable strategy for building towards #350.

sharnoff avatar Jan 06 '24 05:01 sharnoff