Improve the deletion process of the `last_error_event` from the error history of a machine

Open simcod opened this issue 1 year ago • 1 comments

The last_error_event from machines should be cleared from the issue history after some time (about 6 days).

While deploying metal-stack on the new supermicro nodes, we encountered the following problem: Already allocated machines (integrated into a Kubernetes cluster) had the last_error_event of : unexpectedly received in state pxe booting.

The metal-api-liveliness is running in the metal-control-plane namespace. The logs do not show any errors for machines.

{... "msg":"machine liveliness was requested"}
{... "msg":"machine liveliness evaluated","alive":x,"dead":0,"unknown":0,"errors":0}

However, listing the machines with metalctl machine ls returns some allocated machines with a ⭕ crashloop issue.

Dec 19 '24 16:12 simcod

Last event error and crashloop do not depend on each other. To me it sounds like this issue is more about resetting the crashloop field, which should actually happen as soon as a machine reaches phoned home state?

https://github.com/metal-stack/metal-api/blob/master/cmd/metal-api/internal/fsm/states/phoned-home.go#L41

If a last event is shown, this is indicated with an exclamation mark with metalctl and there is a flag for defining how long this looks into the past.

Jan 07 '25 08:01 Gerrit91