helpdesk migrate storage from premium to standard for jenkins-infra, jenkins-weekly and jenkins-release

Service(s)

infra.ci.jenkins.io, release.ci.jenkins.io, weekly.ci.jenkins.io

Summary

a checked with the metrics, standard ZRS hdd will be enough to handle the workload for those 3 controllers. lets try to save some money.

this will be the occasion to handle the volumes/disk with terraform and remove the Datasource annotation from the helmchart values for controllers.

We will need to create a new Storage Class (on publick8s and privatek8s).

Sidenote: we will have to handle the boostrap permissions from the terraform managed volumes.

Reproduction steps

No response

Apr 16 '24 13:04 smerle33

WIP (infra and release)

Apr 16 '24 13:04 smerle33

[x] Add a new storage class for standard ssd ZRS on private and public - https://github.com/jenkins-infra/azure/pull/672

WEEKLY.CI first

[x] take a snapshot for safety - jenkins-weekly-snapshot-20240515-0907
[x] add PV/PVC/DISK for weekly.ci with terraform (https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/persistent_volume and https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/persistent_volume_claim). ~~only creating the PVC should be enough: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#dynamic~~ see https://github.com/jenkins-infra/helpdesk/issues/4044#issuecomment-2121849985
[x] create a temporary pod with both the new PVC as RW and the Weekly PVC as ReadOnly
[x] do a timed rsync test between disk

when time ok:

[x] take a new snapshot for security (remove the one done above)
[x] disable kubernetes-management builds
[x] remove the weekly statefulset
[x] on the temporary pod re-run the rsync (should be fast)
[x] change the helm values for weekly to use the new PVC : persistence.existingClaim to PVC_NAME
[x] enable kubernetes-management builds
[x] merge the PR (changing the chart)
[x] check that the pod start with the correct PV/PVC
[x] remove the OLD PV/PVC/DISK (keep the snapshot for a few days)
[x] remove the snapshot

if all went well, redo for infra.ci/release.ci

Apr 19 '24 07:04 smerle33

Update:

New storage classes added on the 2 clusters in https://github.com/jenkins-infra/azure/pull/672
weekly.ci planned to be migrated next week (Monday 29/Tuesday 30)

Apr 24 '24 06:04 dduportal

Update: on hold until after the 15th of May 2024

May 07 '24 13:05 dduportal

The aim is to be able to change the disk type without recreating everything next time. We choose to create PV/PVC/Disk from terraform instead of just the PVC for that (not following only creating the PVC should be enough: kubernetes.io/docs/concepts/storage/persistent-volumes/#dynamic) The disk size can be changed on both scenarii.

May 21 '24 06:05 smerle33

current state of test of the temporary migration pod:

Events:
    Type     Reason                  Age    From                     Message
    Normal   Scheduled               2m11s  default-scheduler        Successfully assigned jenkins-weekly/migrate-volume to aks-arm64small2-30051376-vmss00001i
    Warning  FailedAttachVolume      2m11s  attachdetach-controller  Multi-Attach error for volume "pvc-<redacted>" Volume is already used by pod(s) jenkins-weekly-0
    Normal   SuccessfulAttachVolume  119s   attachdetach-controller  AttachVolume.Attach succeeded for volume "jenkins-weekly-pv"
    Warning  FailedMount             8s     kubelet                  Unable to attach or mount volumes: unmounted volumes=[jenkins-home-source], unattached volumes=[jenkins-home-source], failed to process volumes=[] timed out waiting for the condition

infos: https://medium.com/@golusstyle/demystifying-the-multi-attach-error-for-volume-causes-and-solutions-595a19316a0c

EDIT : may need to change the current PVC to RWC ReadWriteMany to be able to mount it on a second pod for migration

Jun 05 '24 16:06 smerle33

I tried with pod affinity:

affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions: # app.kubernetes.io/instance: jenkins-weekly
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - jenkins-weekly
        topologyKey: app.kubernetes.io/instance

without any luck

0/8 nodes are available: 1 Insufficient cpu, 2 node(s) didn't match pod affinity rules, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 3 node(s) didn't match P │
│ od's node affinity/selector. preemption: 0/8 nodes are available: 1 No preemption victims found for incoming pod, 7 Preemption is not helpful for scheduling..
pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) didn't match pod affinity rules

will try with node selector directly

Jun 10 '24 12:06 smerle33

I tried with pod affinity:

affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions: # app.kubernetes.io/instance: jenkins-weekly
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - jenkins-weekly
        topologyKey: app.kubernetes.io/instance

without any luck

0/8 nodes are available: 1 Insufficient cpu, 2 node(s) didn't match pod affinity rules, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 3 node(s) didn't match P │
│ od's node affinity/selector. preemption: 0/8 nodes are available: 1 No preemption victims found for incoming pod, 7 Preemption is not helpful for scheduling..
pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) didn't match pod affinity rules

will try with node selector directly

Don't forget that, in any case, you need the tolerations to schedule on arm64 nodes such as this one (ref. https://github.com/jenkins-infra/kubernetes-management/blob/899229e1620277d3750ed261417703a073a4736d/config/jenkins_weekly.ci.jenkins.io.yaml#L29-L35) which might explains why pod affinity was necessary but not sufficient

Jun 10 '24 13:06 dduportal

temporary pod definition:

apiVersion: v1
kind: Pod
metadata:
  name: migrate-volume
  labels:
    name: migrate-volume
  namespace: jenkins-weekly
spec:
  containers:
  - image: debian
    name: migrate-volume-script
    command: ["rsync"]
    args: ["-a", "/var/jenkins_home", "/mnt/"]
    volumeMounts:
    - mountPath: /var/jenkins_home
      name: jenkins-home-source
      readOnly: true
    - mountPath: /mnt
      name: jenkins-home-destination
    resources:
      requests:
        memory: "1Gi"
        cpu: "1000m"
      limits:
        memory: "1Gi"
        cpu: "1000m"
  nodeSelector:
    kubernetes.io/arch: arm64
  tolerations:
    - key: "kubernetes.io/arch"
      operator: "Equal"
      value: "arm64"
      effect: "NoSchedule"
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions: # app.kubernetes.io/instance: jenkins-weekly
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - jenkins-weekly
        topologyKey: app.kubernetes.io/instance
  restartPolicy: Never
  volumes:
  - name: jenkins-home-source
    persistentVolumeClaim:
      claimName: jenkins-weekly
  - name: jenkins-home-destination
    persistentVolumeClaim:
      claimName: jenkins-weekly-data

first try:

forcing the temporary pod on the same node as the jenkins-weekly + affinity fail with :

0/8 nodes are available: 1 Insufficient cpu, 2 node(s) had untolerated taint {CriticalAddonsOnly:
  true}, 5 node(s) didn't match Pod's node affinity/selector. preemption: 0/8 nodes are available: 1 No preemption victims found for incoming pod, 7 Preemption is not helpful for scheduling..

so I though that was the 1 Insufficient cpu matching the node hosting jenkins-weekly

second try:

having the temporary pod pending with affinity
delete jenkins-weekly pod to watch if it spawn a new node and start both pods on the new node

the jenkins-weeklypod stayed on the same node

third try:

having the temporary pod pending with affinity
manually spawn a news arm node in the node pool
delete jenkins-weekly pod to watch if both pods starts on the new node

The jenkins-weekly pod started on the new node: aks-arm64small2-30051376-vmss00001p/10.245.0.13 But the temporary pod stayed pending with :

0/9 nodes are available: 2 node(s) had untolerated taint {CriticalAddonsOnly:
  true}, 3 node(s) didn't match Pod's node affinity/selector, 4 node(s) didn't
  match pod affinity rules. preemption: 0/9 nodes are available: 9 Preemption is
  not helpful for scheduling..

Jun 26 '24 13:06 smerle33

New try:

- spawn manually a new node in armsmall nodepool
- delete the jenkins-weekly pod to see if it start again on the new pod
- start the temporary pod when weekly is on the new node.

Jun 26 '24 15:06 smerle33

New try:

- spawn manually a new node in armsmall nodepool
- delete the jenkins-weekly pod to see if it start again on the new pod
- start the temporary pod when weekly is on the new node.

still the same behavior:

  0/9 nodes are available: 2 node(s) had untolerated taint {CriticalAddonsOnly:
  true}, 3 node(s) didn't match Pod's node affinity/selector, 4 node(s) didn't
  match pod affinity rules. preemption: 0/9 nodes are available: 9 Preemption is
  not helpful for scheduling..

Jun 26 '24 15:06 smerle33

found it affinity was WRONG topologyKey need to be hostname not instance as it refer to the node:

  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - jenkins-weekly
        topologyKey: kubernetes.io/hostname

Jun 26 '24 16:06 smerle33

So the process to follow is :

spawn a new compatible node for both pods (arm64)
delete the jenkins-weekly pod and check that it re-spawn on the new node
create the temporary pod that will spawn on the new node because of affinity

Jun 27 '24 12:06 smerle33

When removing the resources request it can be spawned on the existing node of jenkins-weekly, no need to use a brand new node 🎉

  resources:
      requests:
        memory: "1Gi"
        cpu: "1000m"
      limits:
        memory: "1Gi"
        cpu: "1000m"

Jun 27 '24 14:06 smerle33

final version of the pod migration for jenkins-weekly will rsync data

from /var/jenkins_home on pvc jenkins-weekly
to /mnt/ on pvc jenkins-weekly-data

apiVersion: v1
kind: Pod
metadata:
  name: migrate-volume
  labels:
    name: migrate-volume
  namespace: jenkins-weekly
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
  containers:
  - image: jenkinsciinfra/packaging:latest
    name: migrate-volume-script
    command: ["rsync"]
    args: ["-a", "--delete", "/var/jenkins_home", "/mnt/"] #will create the destination folder within /mnt/
    volumeMounts:
    - mountPath: /var/jenkins_home
      name: jenkins-home-source
      readOnly: true
    - mountPath: /mnt
      name: jenkins-home-destination
  nodeSelector:
    kubernetes.io/arch: arm64
  tolerations:
    - key: "kubernetes.io/arch"
      operator: "Equal"
      value: "arm64"
      effect: "NoSchedule"
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - jenkins-weekly
        topologyKey: kubernetes.io/hostname
  restartPolicy: Never
  volumes:
  - name: jenkins-home-source
    persistentVolumeClaim:
      claimName: jenkins-weekly
  - name: jenkins-home-destination
    persistentVolumeClaim:
      claimName: jenkins-weekly-data

Jun 28 '24 13:06 smerle33

jenkins-weekly cleanup

disk premium removed pvc-d6f52829-b9a9-4d0e-bd38-777005a77d08 PV removed

old snapshot jenkins-weekly-snapshot-20240515-0907 removed new snapshot done, to remove later

Jul 03 '24 09:07 smerle33

migration for release.ci

[x] create PV/PVC/DISK for release.ci on privatek8s - https://github.com/jenkins-infra/azure/pull/768
[x] run migration pod to prepare new disk - details
[x] make a snapshot of old volume (premium) - Snapshot.backup-release-ci-20240703-1120-20240703112120
[x] prepare PR of migration to have the checks - https://github.com/jenkins-infra/kubernetes-management/pull/5379
[x] prepare release.ci for shutdown to lock new builds
[x] when builds all done
[x] disable kubernetes management
[x] scaleset to 0 (to lock jenkins_home)
[x] launch migration pod again (should be fast)
[x] merge PR of migration - https://github.com/jenkins-infra/kubernetes-management/pull/5379
[x] enable kubenertes management (launch build) - https://infra.ci.jenkins.io/job/kubernetes-jobs/job/kubernetes-management/job/main/37565/pipeline-graph/
[x] check release.ci use new PV/PVC/DISK

--- something wrong on the UI, missing data SEE POSTMORTEM below

[x] cleanup old volume/PV (24h later) - pvc-4d27fa9e-2d4f-4a44-88fa-862996ca3706
[x] cleanup snapshot (1 week later) -

------------------------------ POSTMORTEM the rsync was false:

command: ["rsync"]
    args: ["-v", "-a", "--delete", "/var/jenkins_home", "/mnt/"] #will create the destination folder within /mnt/

we don't need the jenkins_home folder to be created in /mnt/ the correct command is:

command: ["rsync"]
    args: ["-v", "-a", "--delete", "/var/jenkins_home/", "/mnt/"]

Jul 03 '24 09:07 smerle33

pod template for migration for jenkins-release

apiVersion: v1
kind: Pod
metadata:
  name: migrate-volume
  labels:
    name: migrate-volume
  namespace: jenkins-release
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
  containers:
  - image: jenkinsciinfra/packaging:latest
    name: migrate-volume-script
    command: ["rsync"]
    args: ["-v", "-a", "--delete", "/var/jenkins_home/", "/mnt/"] #will create the destination folder within /mnt/
    volumeMounts:
    - mountPath: /var/jenkins_home
      name: jenkins-home-source
      readOnly: true
    - mountPath: /mnt
      name: jenkins-home-destination
  nodeSelector:
    kubernetes.io/arch: arm64
    kubernetes.azure.com/agentpool: releacictrl
  tolerations:
    - key: "kubernetes.io/arch"
      operator: "Equal"
      value: "arm64"
      effect: "NoSchedule"
    - key: "jenkins"
      operator: "Equal"
      value: "release.ci.jenkins.io"
      effect: "NoSchedule"
    - key: "jenkins-component"
      operator: "Equal"
      value: "controller"
      effect: "NoSchedule"
  restartPolicy: Never
  volumes:
  - name: jenkins-home-source
    persistentVolumeClaim:
      claimName: jenkins-release
  - name: jenkins-home-destination
    persistentVolumeClaim:
      claimName: jenkins-release-data

Jul 03 '24 09:07 smerle33

migration for infra.ci

[x] create PV/PVC/DISK for infra.ci on privatek8s - https://github.com/jenkins-infra/azure/pull/771
[x] run migration pod to prepare new disk - details
[x] make a snapshot of old volume (premium) - jenkins-infra-data-20240704-15h30Z
[x] prepare PR of migration to have the checks - https://github.com/jenkins-infra/kubernetes-management/pull/5388
[x] prepare infra.ci for shutdown to lock new builds
[x] when builds all done
[x] disable kubernetes management
[x] scaleset to 0 (to lock jenkins_home)
[x] launch migration pod again (should be fast)
[x] merge PR of migration -
[x] enable kubernetes management (launch build) -
[x] check infra.ci use new PV/PVC/DISK
[x] cleanup old volume/PV (24h later) -
[x] cleanup snapshot (1 week later) -

Jul 04 '24 08:07 smerle33

pod template for migration for jenkins-infra

apiVersion: v1
kind: Pod
metadata:
  name: migrate-volume
  labels:
    name: migrate-volume
  namespace: jenkins-infra
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
  containers:
  - image: jenkinsciinfra/packaging:latest
    name: migrate-volume-script
    command: ["rsync"]
    args: ["-v", "-a", "--delete", "/var/jenkins_home/", "/mnt/"]
    volumeMounts:
    - mountPath: /var/jenkins_home
      name: jenkins-home-source
      readOnly: true
    - mountPath: /mnt
      name: jenkins-home-destination
  nodeSelector:
    kubernetes.io/arch: arm64
    kubernetes.azure.com/agentpool:infracictrl
  tolerations:
    - key: "kubernetes.io/arch"
      operator: "Equal"
      value: "arm64"
      effect: "NoSchedule"
    - key: "jenkins"
      operator: "Equal"
      value: "infra.ci.jenkins.io"
      effect: "NoSchedule"
    - key: "jenkins-component"
      operator: "Equal"
      value: "controller"
      effect: "NoSchedule"
  restartPolicy: Never
  volumes:
  - name: jenkins-home-source
    persistentVolumeClaim:
      claimName: jenkins-infra
  - name: jenkins-home-destination
    persistentVolumeClaim:
      claimName: jenkins-infra-data

Jul 04 '24 08:07 smerle33

Update:

Cleaned left over PVCs (unbound ones)
No more PV to clean up
Cleaned up letf over disks (unmounted ones)
Cleaned up snapshots (unattached ones)

Jul 09 '24 11:07 dduportal