[bug] use of kube-prometheus-stack helm template in cluster template break Omni
UPDATE: READ THE COMMENT BELOW FIRST
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
If you create a cluster using omnictl cluster template sync with a machine class and machine labels:
- the provisionning will fail and the nodes will go into "unknown" state
- the nodes will stop responding entirely. Rebooting either of the nodes/omni will not fix it. Destroying wont fix it.
- the only way to recover is to nuke enterly the nodes from both the GUI and the nodes storage as well and start over from scratch.
The nodes are stuck forever with :
and whenever they recieve a new command they will print
Expected Behavior
provision the cluster based on the selected nodes machineClass
Steps To Reproduce
- On clean Omni, generate ISO with both amd extensions, drbd and zfs. boot the 3 machines
- Once the machines have joined Omni, add bootstrap patch with
hostname, add bootstrap2 patch with basic extension config and certificate rotation - in the Machine menu, add a label
o0. - Go to Machine classes, create a machineClass
o0with filtero0 - create a template with, amongst other patches and configurations, the following
machineClass:
name: o0
size: 3
- run
omnictl cluster template sync --file o0
watch your machines burn.
What browsers are you seeing the problem on?
No response
Anything else?
tested : happens in both omni 0.33 and 0.34. Talos 1.7.0
actually after investigating the issue further this is not due to the label facility. This is due to the kube-prometheus-stack
helm template --include-crds -n monitoring -f apps/kube-prometheus-stack.helm.yaml kps prometheus-community/kube-prometheus-stack --create-namespace | yq -i 'with(.cluster.inlineManifests.[] | select(.name=="monitoring-stack"); .contents=load_str("/dev/stdin"))' patches/monitoring-stack.yaml
then adding that patch to your cluster template.The missing pre/post hooks (which are not included in helm template .., as opposed to helm install ... somehow break the cluster nodes (before the install even starts). It doesnt matter what you put in the helm values I think. But if you need a sample I can provide.
That is as far as I managed to debug this so I disabled the patch and it works. I edited this bug to reflect this but if you feel it is out of scope for Omni feel free to close it.