tink
tink copied to clipboard
Hardware profile assertion checks
Expected Behaviour
In Tinkerbell, workflows are scheduled to specific hardware. That hardware should be guaranteed.
Current Behaviour
If the hardware memory, cpu, or networking profile of the physical (or virtual) device does not match the expectations defined in Tinkerbell, the workflows may fail in unpredictable ways late in the process.
Possible Solution
Within OSIE or within an independent or stacked workflow, assertions should be made that the booted hardware matches the hardware profile that was defined.
If the hardware profile does not match, the workflow could be failed early in the process due to a misconfiguration fault.
Steps to Reproduce (for bugs)
- create a hardware profile with more cpu, ram, or networking interfaces than are physically available
- configure a workflow that takes advantage of these resources
- boot the workflow on the under-resourced hardware
- when the workflow fails, try to diagnose why it failed :-) (this failure could have been avoided)
Context
Physical servers can be installed with a missing or extra network card, the wrong amount of ram, or disks.
Server BIOS / firmware settings may not match expectations (such as virtualization options). These settings can be defined in the hardware profile and asserted at an OSIE stage of boot-up.
These are not hypothetical problems, these have happened.
This may be better discussed as a proposal or as an OSIE issue. I don't have enough background to offer this as a proposal, so I am opening this as an issue for early discussion. Some level of hardware assertion may already be present that I do not know about, perhaps we could extend that?
@displague there's some code lying around in OSIE that does this for disks, but its just a warning atm. This hasn't been fleshed out because we didn't really have confidence in the underlying data stored in the DB. Its poor form to fail a provision because hw data doesn't match reality but would have succeeded if continued.
@mikemrm is about done working on a "discovery os" as a non-workflow OSIE action. Once that's workflowized we can be use it as part of a server discovery step and then be more confident in the data stored in Hardware. This way we'd be more comfortable making these checks more of a hard requirement.
Not sure how strict to be with requiring exact match. Besides RAM and maybe disk size I'm not so sure more disks, nics or CPUs would make a lot of sense to handle gracefully.
@mmlb, I think you would have more experience with the contexts I provided than I do.
Do you think those are valid concerns?
From what I have witnessed, users can go hours to days with misconfigured hardware and be left to wonder why software isn't working. In the meanwhile, if they are leasing this hardware, they are not getting what they paid for (under or overcharged).