metal-api icon indicating copy to clipboard operation
metal-api copied to clipboard

RFC: Possibility to store operational issues to a machine

Open majst01 opened this issue 4 years ago • 7 comments

During normal operations, it is sometimes the case that there are failures regarding a machine like:

  • hard disk errors
  • network card with duplicate mac
  • cabeling
  • powersupply failure
  • etc.

It might be a good idea to track such issues by machine. To do so we could either:

  • create an issue in our private issue tracker and have a open issues field on the machine
  • be independent from a external issue tracker, but then most of the functionality must be re-implemented here, not an option
  • more options please here...

For me it feels like add MachineIssue the right approach.

type MachineIssue struct {
   MachineID string
   Description string
   URL string
   CreatedAt time.Time
   ClosedAt time.Time
}

Then we can add the following metalctl command:

metalctl machine issue add <machineID> --description "nvme disk timeout" --issueurl "https://github.com/metal-stack/metal-api/issues/2" 

And the other way round, machine listing will add a Sign to machines with issues:

metalctl machine issues
ID                                                      LAST EVENT      WHEN    AGE     HOSTNAME        PROJECT SIZE            IMAGE   PARTITION  ISSUE ISSUEURL
00000000-0000-0000-0000-ac1f6b2d34a4                    Preparing ↻     4s   fra-equ01 nvme disk timeout https://github.com/metal-stack/metal-api/issues/2
``

majst01 avatar Mar 20 '20 07:03 majst01

/cc @Gerrit91 @mwennrich @ulrichSchreiner WDYT ?

majst01 avatar Mar 20 '20 08:03 majst01

Sorry, I do not really have a strong opinion about that. Only thing that comes to my mind is that there would be the opportunity to add this to MachineState, such that we do not only have a "locked" and "reserved" state but also "maintenance" or "defect" or whatever. I think someone from operations should say if this would help them, @mwennrich?

Gerrit91 avatar Mar 23 '20 13:03 Gerrit91

i'm unsure about this feature. first it sounds good, but who creates this issues? and more important: how do you make sure that such issues are removed from the machine when it is resolved?

does a machine with an issue mark this machine as defect or unusable? if this is not the case than after some time you will have machines with many issues and do not know if any of these issues is already fixed.

ulrichSchreiner avatar Mar 31 '20 12:03 ulrichSchreiner

Could potentially be done with a issue webhook in gitlab ?? https://docs.gitlab.com/ee/user/project/integrations/webhooks.html#issue-events

majst01 avatar Mar 31 '20 12:03 majst01

Gitlab issues are freetext .... it will be hard to connect them to a specific machine-ID and do a specific REST-call when an issue event happens. I'm still missing which parts of metal-api should inspect the issues table and why. such machines are allocatable?

ulrichSchreiner avatar Mar 31 '20 13:03 ulrichSchreiner

@mwennrich have you some opinions here ?

majst01 avatar Apr 16 '20 05:04 majst01

related: https://github.com/metal-stack/metal-hammer/issues/17

majst01 avatar May 11 '20 06:05 majst01