omni icon indicating copy to clipboard operation
omni copied to clipboard

[bug] use of kube-prometheus-stack helm template in cluster template break Omni

Open bernardgut opened this issue 1 year ago • 1 comments

UPDATE: READ THE COMMENT BELOW FIRST

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

If you create a cluster using omnictl cluster template sync with a machine class and machine labels:

  • the provisionning will fail and the nodes will go into "unknown" state
  • the nodes will stop responding entirely. Rebooting either of the nodes/omni will not fix it. Destroying wont fix it.
  • the only way to recover is to nuke enterly the nodes from both the GUI and the nodes storage as well and start over from scratch.

The nodes are stuck forever with : image and whenever they recieve a new command they will print image

Expected Behavior

provision the cluster based on the selected nodes machineClass

Steps To Reproduce

  1. On clean Omni, generate ISO with both amd extensions, drbd and zfs. boot the 3 machines
  2. Once the machines have joined Omni, add bootstrap patch with hostname, add bootstrap2 patch with basic extension config and certificate rotation
  3. in the Machine menu, add a label o0.
  4. Go to Machine classes, create a machineClass o0 with filter o0
  5. create a template with, amongst other patches and configurations, the following
machineClass:
  name: o0
  size: 3
  1. run omnictl cluster template sync --file o0

watch your machines burn.

What browsers are you seeing the problem on?

No response

Anything else?

tested : happens in both omni 0.33 and 0.34. Talos 1.7.0

bernardgut avatar Apr 26 '24 15:04 bernardgut

actually after investigating the issue further this is not due to the label facility. This is due to the kube-prometheus-stack

helm template --include-crds -n monitoring -f apps/kube-prometheus-stack.helm.yaml kps prometheus-community/kube-prometheus-stack --create-namespace | yq -i 'with(.cluster.inlineManifests.[] | select(.name=="monitoring-stack"); .contents=load_str("/dev/stdin"))' patches/monitoring-stack.yaml

then adding that patch to your cluster template.The missing pre/post hooks (which are not included in helm template .., as opposed to helm install ... somehow break the cluster nodes (before the install even starts). It doesnt matter what you put in the helm values I think. But if you need a sample I can provide.

That is as far as I managed to debug this so I disabled the patch and it works. I edited this bug to reflect this but if you feel it is out of scope for Omni feel free to close it.

bernardgut avatar Apr 27 '24 09:04 bernardgut