claudie icon indicating copy to clipboard operation
claudie copied to clipboard

Feature: Avoid using `latest` templates by default and consider pinning via CRD

Open bernardhalas opened this issue 6 months ago • 10 comments

Motivation

terraformer template selection currently defaults to the latest git tag. This is not a good behavior, because sometimes new template changes might have a Claudie upgrade as a dependency and the switch to a new template needs to happen in a controlled manner.

Moreover, should the user override templates for several providers, this issue proposes a new CRD object, that will avoid copy/pasting code over the InputManifest.spec.providers blocks and will also allow improving UX for migrating all the cluster manifests to a new version; in such a case only one modification is required - in the Custom Resource.

Description

By default, this issue proposes that Claudie always takes templates from git tag identical with the Claudie release (e.g. v0.9.10). The repo/tag selection could be overridden via the following API: https://docs.claudie.io/latest/input-manifest/external-templates/; or alternatively via the below CRD, should the below proposal be accepted.

Now, as a part of this topic, we propose moving the template reference to a custom CRD:

apiVersion: claudie.io/v1beta1
kind: TemplateReference
metadata:
  name: private-templates
  labels:
    app.kubernetes.io/part-of: claudie
spec:
  repository: "https://git.somehost.io/claudie-templates"
  tag: "v0.9.10"
  path: "production"

The path further expects production/terraformer/<provider (e.g. genesiscloud)>

On top of this, we suggest to always deploy the following custom resource within the install guide:

apiVersion: claudie.io/v1beta1
kind: TemplateReference
metadata:
  name: upstream-templates
  labels:
    app.kubernetes.io/part-of: claudie
spec:
  repository: "https://github.com/berops/claudie-config"
  tag: "v0.9.10"
  path: "templates"

Such a TemplateReference will be referenced in the InputManifest the following way:

spec:
  providers:
    - name: genesiscloud
      providerType: genesiscloud
      templateRef: private-templates # optional, default to `upstream-templates`
      secretRef:
        name: genesiscloud-secret
        namespace: secrets

The CRD, should it exist, will take a precedence over the template: stanza.

Exit criteria

  • [ ] Optionally, migrate the template version pinning API to a CRD
  • [ ] By default, always consume the templates from the tag, that matches Claudie release tag

bernardhalas avatar Jun 06 '25 10:06 bernardhalas

The TemplateRepository CRD could be useful if you would have many InputManifests referencing the same templates, which would skip some boilerplate and possible typos copy/paste errors. Thats the only pro I can think of compared to the current solution

Additionally, for future use cases where anyone could define their own version of say ansible-playbooks/manifests to be deployed we would also need to implement another CRD that would be referenced per-cluster, for this I would like to use the Settings CRD introduced in the Envoy PR.

It allows you to reference a Setting custom resource at Role level of the loadbalancer that overwites the default configs that claudie ships with custom ones.

...
  loadBalancers:
    roles:
      - name: apiserver
        protocol: tcp
        port: 6443
        targetPort: 6443
        targetPools:
            - control-pools
        settings:
            proxyProtocol: true
            stickySessions: false
+        settingsRef:
+          name: custom-envoy
+          namespace: claudie

This could be then further extented to be also used at the kuberentes cluster level for custom ansible-playbooks/manifests

  kubernetes:
    clusters:
      - name: dev-cluster
        version: 1.27.0
        network: 192.168.2.0/24
+     settingsRef:
+        name: custom-manifests
+        namespace: claudie
        pools:
          control:
            - control-htz
            - control-gcp
          compute:
            - compute-htz
            - compute-gcp
            - compute-azure
            - htz-autoscaled

and the Settings CRD would look like

apiVersion: claudie.io/v1beta1
kind: Setting
metadata:
  name: custom-envoy
  namespace: claudie
  labels:
    app.kubernetes.io/part-of: claudie
spec:
  envoy:
    lds: |
         ...
    cds: |
         ...
apiVersion: claudie.io/v1beta1
kind: Setting
metadata:
  name: custom-manifests
  namespace: claudie
  labels:
    app.kubernetes.io/part-of: claudie
spec:
  ansible:
    playbooks: ["list of items to download and apply"]

There could be multiple Settings created or simply a single giant one that would be referenced at different levels in the InputManifest and always the context relavant to where it was referenced would be extracted.

So if you would reference the custom-envoy at the kubernetes cluster level it would do nothing as it would extract spec.ansible or spec.manifests overwrites which would be empty.

Despire avatar Jun 06 '25 14:06 Despire

Note that having a pre-defined version of templates for each release would result in rolling-update of the infrastructure whenever claudie itself is updated.

Despire avatar Jun 06 '25 14:06 Despire

@Despire

The TemplateRepository CRD could be useful if you would have many InputManifests referencing the same templates, which would skip some boilerplate and possible typos copy/paste errors. Thats the only pro I can think of compared to the current solution

Additionally, for future use cases where anyone could define their own version of say ansible-playbooks/manifests to be deployed we would also need to implement another CRD that would be referenced per-cluster, for this I would like to use the Settings CRD introduced in the Envoy PR.

It allows you to reference a Setting custom resource at Role level of the loadbalancer that overwites the default configs that claudie ships with custom ones.

...
  loadBalancers:
    roles:
      - name: apiserver
        protocol: tcp
        port: 6443
        targetPort: 6443
        targetPools:
            - control-pools
        settings:
            proxyProtocol: true
            stickySessions: false
+        settingsRef:
+          name: custom-envoy
+          namespace: claudie

This could be then further extented to be also used at the kuberentes cluster level for custom ansible-playbooks/manifests

  kubernetes:
    clusters:
      - name: dev-cluster
        version: 1.27.0
        network: 192.168.2.0/24
+     settingsRef:
+        name: custom-manifests
+        namespace: claudie
        pools:
          control:
            - control-htz
            - control-gcp
          compute:
            - compute-htz
            - compute-gcp
            - compute-azure
            - htz-autoscaled

and the Settings CRD would look like

apiVersion: claudie.io/v1beta1
kind: Setting
metadata:
  name: custom-envoy
  namespace: claudie
  labels:
    app.kubernetes.io/part-of: claudie
spec:
  envoy:
    lds: |
         ...
    cds: |
         ...
apiVersion: claudie.io/v1beta1
kind: Setting
metadata:
  name: custom-manifests
  namespace: claudie
  labels:
    app.kubernetes.io/part-of: claudie
spec:
  ansible:
    playbooks: ["list of items to download and apply"]

There could be multiple Settings created or simply a single giant one that would be referenced at different levels in the InputManifest and always the context relavant to where it was referenced would be extracted.

So if you would reference the custom-envoy at the kubernetes cluster level it would do nothing as it would extract spec.ansible or spec.manifests overwrites which would be empty.

For clarification, would these Settings object would be included to the data passed to the terraform templates? If so and if this Settings CRD can contain any type structure users want, this would complement perfectly #1756 as users would be now able to create any logic they want for provider-specific and/or environment-specific behavior/constraints within templates with the custom data they need.

As of right now you would have to use an annotation in the cluster/node pool definition with a json string to then unparse in template (which isn't possible at the moment and why I created #1756) to be able to pass arbitrary data to templates, which is really not something I'd like to actually deploy, both because annotations are absolutely not meant for this, and because of the security implications of having (possibly critical) configuration data be stored in an annotation as plaintext.

torsina avatar Jun 06 '25 17:06 torsina

When will this set of FR be reviewed, and if aproved, PR open to be able to start working on these?

torsina avatar Jun 08 '25 22:06 torsina

Now that #1756 (and by extension #1768) are merged, all kinds of computation can now be done in templates, but the issue remain that users can't pass arbitrary data to templates in other ways than annotations/labels on Cluster/NodePool CRDs. Should this FR be worked on next? @Despire @bernardhalas

torsina avatar Jun 13 '25 19:06 torsina

We are still discussing as to how exactly we will implement this. As for if this FR will be worked on next, I would say so, once we agree on an implementation, until then this will be on hold.

but the issue remain that users can't pass arbitrary data to templates

We'll also try to consider passing user-defined data to the templates, if the scope of this FR will not require to much changes, if yes we might handle it as another FR after this one.

Despire avatar Jun 16 '25 04:06 Despire

This issue is a tip of the iceberg of larger set of issues, all linked together.

  1. How to template manifests/playbooks/configs not just for terraformer, but also for ansibler (Ansible plays), for kubeEleven (KubeOne), for kuber (K8s manifests). The concept is going to be designed as a part of this issue. The implementation for terraformer will also be a goal of this issue.
  2. We will design how to pass a custom dataset to the templating engine, so that user has a full transparency over which variables and variable keys can be referenced in the user templates.
  3. We will design an interface for custom cloud provider creation, by passing the full provider block from the InputManifest, including referenced secrets.

Further down, this issue will deal just with item 1. from this list; and just for the terraformer service. The implementation for other services (ansibler (vm config management in ansible), kuber (claudie default k8s manifests), kubeEleven (k8s provisioning), LB config override) will be treated via individual issues.

bernardhalas avatar Jun 21 '25 12:06 bernardhalas

There's a need to template several sets of manifets, for different purposes, of different formats, languages and sizes. The general template interpolation syntax is Go Templates, as Go has been chosen as the language Claudie has been written in.

Now, there are the following options how to organize the templates.

1. New CRD type that would contain full template (e.g. CRD per category) or a CRD with several templates, differentiated by keys; e.g.

specs:
  playbooks: |
    ...
  k8s-manifests: |
    ...

This CR would be referenced in the InputManifest like:

specs:
  kubernetes:
    clusters:
      - name: dev-cluster
+       settingsRef: regionalNfsServers

The downside is the requirement for a full unwrap of the set of manifests that a user would like to override (e.g. for kuber service). There's a little to no difference if the we separate the template categories by keys (e.g. playbooks, k8s-manifests, ...) or we create different CRD for each (PlaybookSettings, K8sManifestSettings,...).

2. Store templates in a separate Git repo

Define TemplateReference CRD, that would be a pointer to a git repo with templates.

kind: TemplateGitReference
metadata:
  name: raspberry-optimized-template
specs:
  hostname: git.mycompany.com
  commitRef: feature/privateEdgeRegistry
  secretRef: gitToken
  method: https
  path: 'rpi-gen5/'

This template would then be referenced at various places in the InputManifest like:

specs:
  kubernetes:
    clusters:
    - name: development
+     templateRef: raspberry-optimized-template
...
  loadbalancers:
    roles:
    - name: denver-03-hospital
+     templateRef: raspberry-optimized-template

The downside is the additional overhead of the TemplateGitReference CRD that would point to the git repo with the templates. The advantage is that if the same template reference is used multiple times, upgrading to a new version needs to be done in just one place.

Currently, we're in favor of option no. 2.

Next, the various reference points for the templates; preferred option listed first:

  1. terraformer OpenTofu/Terraform templates: specs.nodepools.dynamic.[*].terraformRef, alternatively specs.providers.[*].terraformRef
  2. LB configs: specs.loadbalancers.roles.[*].lbconfigsRef, alternatively specs.loadbalancers.clusters.[*].lbconfigsRef
  3. ansibler playbooks: specs.nodepools.[dynamic|static].[*].playbooksRef and specs.loadbalancers.roles.[*].playbookRef, alternatively specs.loadbalancers.clusters.[*].playbooksRef
  4. kuber manifests: specs.kuberenetes.clusters.[*].addonsRef
  5. kubeEleven manifest: specs.kubernetes.clusters.[*].provisionRef

It's going to be expected that a single template reference will provider both, templates for kuber and for kubeEleven, because the structure of the git repo with templates will look like:

repo
├── addons
├── k8s
├── lbconfigs
├── playbooks
└── terraform

Next, a special attention will need to be given to the kuber manifests (addons folder), that will contain manifest for deploying and customizing Longhorn, Grafana, Prometheus, and other Claudie addons, but that will come in a different GH issue, which will be referenced here.

Feedback would be greatly welcome here.

bernardhalas avatar Jun 21 '25 13:06 bernardhalas

My opinion is option 2 would be better here, as it extends on a kind of behavior Claudie is already using with the external git template repo.

As for where the refs should be put in the InputManifest linking to this new CRD, I agree on the first option you mentioned for each. However, requiring those refs in every NodePool (for the ones concerned) could become tedious. To fix this, maybe a Provisioner > Cluster > NodePool default inheritance system could be implemented, whereas Cluster and Provider would also have a field for the Refs in NodePools (terraform, ansible etc..) where each level up would act as default if the field isn't present in the level bellow.

This would mean that when NodePools belonging to a cluster don't have the ref but the cluster does, the NodePools that don't have a ref themselves would inherit the ref from the cluster, same thing one level up with providers.

It would also make so that someone wanting to use a single instance of this new CRD on a single provider in charge of a lot of clusters and pools, these fields would only have to be mentioned once at the provider level and be done with it

torsina avatar Jun 23 '25 21:06 torsina

While the proposed approach gives most power and control to the users, we feel like it's quite complex from the UX perspective (e.g. a user might be confused where is a particular template sourced from, when it comes to hierarchical overrides).

Let's take it from the completely opposite angle - is there a use-case for having multiple template locations per single InputManifest? That the user would need to refer to multiple sets of Template References in a single manifest and therefore one reference would not suffice?

bernardhalas avatar Jul 04 '25 14:07 bernardhalas