Move Piraeus Operator Helm chart to helm-charts repository
The Helm chart to deploy the Piraeus Operator is still located at https://github.com/piraeusdatastore/piraeus-operator/tree/v2/charts/piraeus. It is not possible to install it without cloning the whole piraeus-operator repository. Instead it should be moved to the Helm Charts repository.
Furthermore, it should be considered to merge it with the linstor-cluster Helm chart. It contains the missing resources to configure the controller, satellites etc.
You may want to take some inspiration from the chart we built here. It helps the user install both the operator and build a cluster, including control over the configs of different components.
We found that there's a risk with doing this through Helm as any hiccup during installation (such as a webhook not being online yet) can cause the initial chart install to fail and uninstall for reinstallation. If that deletes the LinstorCluster object, it will trigger the deletion of the secret containing the master passphrase, causing at best a cluster disruption and at worst data loss. The charts for 2.8.1 and 2.9.0 now ensure that those resources are always kept even if the chart is uninstalled, to prevent that problem.
How do you deal with installing the CRDs, waiting for the webhook and only then installing the actual CRs?
This used to be an issue, but with this chart it can survive a chart failure + chart reinstall as the cluster config stays intact
a chart failure + chart reinstall
Ideally, users will never have that. I guess that is fine if you are using some automation that simply tries the install again, but would be a horrible first experience if the install command fails because helm install ... returns some weird error.
Now I looked at another project that is having similar issues, kube-prometheus-stack. Their solution is to:
- Install the CRDs as using the
crdssubdirectory in the chart. - Install a webhook, but set the failure policy to
Ignore
The second part we could take inspiration from, but the first part is an issue. The problems are summaries here. While our CRDs are stable in the sense that old stuff continues to work, as we gain new features there are also additional keys on the CRDs. If some new feature does not work because helm did not upgrade the CRDs that is again pretty bad user experience.
That's the reason for the two separate charts currently: one installs the CRDs, operator and webhook, the other applies the "workload resources" in the form of CRs.
@WanzenBug Usually we use helm install --include-crds. Since we use ArgoCD for deployment and commit the whole manifest CRD updates are not a big deal.
We are looking at something like crd-bootstrap to help with CRD upgrades. Some projects have moved CRDs out of /crds and into /templates so that they can be managed by Helm. But it creates challenges with deployment dependencies and there's no clean way to move from one approach to the other.
Specifically with the Piraeus Operator though, we've seen initial chart install failures coming from the fact that the webhook needs some time to come online. Creating a LinstorCluster before the webhook is up will cause an error. You could look at deploying the webhook as a pre-install hook in the chart, with a job that waits for the webhook to be healthy.
Our pack also configures the storageclass for Piraeus. I found that changing the storageclass parameters (which isn't supported) would also cause a helm failure, leading to uninstall & reinstall of the whole thing. That was the main initial reason to ensure we adjusted the pack so it would not cause loss of the master passphrase. We now deploy the storageclass as a post-install & post-upgrade hook, which allows the storageclass resource to be automatically replaced if you make a change to it.
Some more info on the kube-prometheus-stack chart, they describe their approach like this:
How the Chart Configures the Hooks
A validating and mutating webhook configuration requires the endpoint to which the request is sent to use TLS. It is possible to set up custom certificates to do this, but in most cases, a self-signed certificate is enough. The setup of this component requires some more complex orchestration when using helm. The steps are created to be idempotent and to allow turning the feature on and off without running into helm quirks.
- A pre-install hook provisions a certificate into the same namespace using a format compatible with provisioning using end user certificates. If the certificate already exists, the hook exits.
- The prometheus operator pod is configured to use a TLS proxy container, which will load that certificate.
- Validating and Mutating webhook configurations are created in the cluster, with their failure mode set to Ignore. This allows rules to be created by the same chart at the same time, even though the webhook has not yet been fully set up - it does not have the correct CA field set.
- A post-install hook reads the CA from the secret created by step 1 and patches the Validating and Mutating webhook configurations. This process will allow a custom CA provisioned by some other process to also be patched into the webhook configurations. The chosen failure policy is also patched into the webhook configurations
Alternatives
It should be possible to use
jetstack/cert-managerif a more complete solution is required, but it has not been tested.You can enable automatic self-signed TLS certificate provisioning via cert-manager by setting the prometheusOperator.admissionWebhooks.certManager.enabled value to true.
So I think this is a workable approach we can use. Setting the failure policy initially to Ignore will allow creation to succeed, and a post-install hook can then update the failure policy