cluster-api-provider-proxmox icon indicating copy to clipboard operation
cluster-api-provider-proxmox copied to clipboard

add MachineHealthCheck

Open 3deep5me opened this issue 1 year ago • 5 comments

Hi @sp-yduck,

this adds a MachineHealthCheck for all Machines of a new cluster. This can help if the node does not go into a running state. E.g. #145 or network problems during startup or other reasons.

3deep5me avatar Nov 14 '23 12:11 3deep5me

Thank you for the PR ! As you may know this template file is used for Quick Start. I want to keep Quick Start with minimal setups. So if you want to include this, the options is

  1. create new template for it so that users can choose specific template to try specific features ref: https://cluster-api.sigs.k8s.io/clusterctl/commands/generate-cluster.html?highlight=flavor#flavors ref: https://cluster-api.sigs.k8s.io/clusterctl/commands/generate-cluster.html?highlight=flavor#alternative-source-for-cluster-templates

sp-yduck avatar Nov 15 '23 01:11 sp-yduck

I also changed the location of the quick start i hope this is fine for you. I don't know how the clusterctl knows where to find the cluster-templates. Do i have to change something else to make this work?

3deep5me avatar Nov 16 '23 09:11 3deep5me

clusterctl checks assets of the release so the file changes are ok. the thing is I am using make release to output these assets for each release.

  1. make release-templates https://github.com/sp-yduck/cluster-api-provider-proxmox/blob/bbcdd56993d21f9e3581eed558806fc909b71cec/Makefile#L219-L220

  2. make generate-e2e-templates https://github.com/sp-yduck/cluster-api-provider-proxmox/blob/bbcdd56993d21f9e3581eed558806fc909b71cec/Makefile#L111-L113

I think you can use kustomize build template/base or something for both of them

sp-yduck avatar Nov 17 '23 03:11 sp-yduck

btw I believe mhc does not help for

E.g. https://github.com/sp-yduck/cluster-api-provider-proxmox/issues/145 or network problems during startup or other reasons.

since mhc checks Machine and Node object to confirm if Node(in workload cluster) is ready. so like issue #145 , if the vm goes into unhealthy before it joins k8s cluster, mhc cannot find the Node associated to that unhealthy vm and cannot remediate it.

sp-yduck avatar Nov 20 '23 01:11 sp-yduck

I'm not sure about the detailed mechanics. But I can confirm that if a VM does not boot, it is deleted and recreated. (But at the moment no VM boots because i get the error every time 😢)

I think mhc also checks on default the status.condtion[].type.Ready field. Or that is the only way I can explain the behavior.

3deep5me avatar Nov 23 '23 17:11 3deep5me