metal-api
metal-api copied to clipboard
RFC: Possibility to store operational issues to a machine
During normal operations, it is sometimes the case that there are failures regarding a machine like:
- hard disk errors
- network card with duplicate mac
- cabeling
- powersupply failure
- etc.
It might be a good idea to track such issues by machine. To do so we could either:
- create an issue in our private issue tracker and have a
open issues
field on the machine - be independent from a external issue tracker, but then most of the functionality must be re-implemented here, not an option
- more options please here...
For me it feels like add MachineIssue the right approach.
type MachineIssue struct {
MachineID string
Description string
URL string
CreatedAt time.Time
ClosedAt time.Time
}
Then we can add the following metalctl
command:
metalctl machine issue add <machineID> --description "nvme disk timeout" --issueurl "https://github.com/metal-stack/metal-api/issues/2"
And the other way round, machine listing will add a Sign to machines with issues:
metalctl machine issues
ID LAST EVENT WHEN AGE HOSTNAME PROJECT SIZE IMAGE PARTITION ISSUE ISSUEURL
00000000-0000-0000-0000-ac1f6b2d34a4 Preparing ↻ 4s fra-equ01 nvme disk timeout https://github.com/metal-stack/metal-api/issues/2
``
/cc @Gerrit91 @mwennrich @ulrichSchreiner WDYT ?
Sorry, I do not really have a strong opinion about that. Only thing that comes to my mind is that there would be the opportunity to add this to MachineState
, such that we do not only have a "locked" and "reserved" state but also "maintenance" or "defect" or whatever. I think someone from operations should say if this would help them, @mwennrich?
i'm unsure about this feature. first it sounds good, but who creates this issues? and more important: how do you make sure that such issues are removed from the machine when it is resolved?
does a machine with an issue mark this machine as defect or unusable? if this is not the case than after some time you will have machines with many issues and do not know if any of these issues is already fixed.
Could potentially be done with a issue webhook in gitlab ?? https://docs.gitlab.com/ee/user/project/integrations/webhooks.html#issue-events
Gitlab issues are freetext .... it will be hard to connect them to a specific machine-ID and do a specific REST-call when an issue event happens.
I'm still missing which parts of metal-api
should inspect the issues table and why. such machines are allocatable?
@mwennrich have you some opinions here ?
related: https://github.com/metal-stack/metal-hammer/issues/17