migrate storage from premium to standard for jenkins-infra, jenkins-weekly and jenkins-release
Service(s)
infra.ci.jenkins.io, release.ci.jenkins.io, weekly.ci.jenkins.io
Summary
a checked with the metrics, standard ZRS hdd will be enough to handle the workload for those 3 controllers. lets try to save some money.
this will be the occasion to handle the volumes/disk with terraform and remove the Datasource annotation from the helmchart values for controllers.
We will need to create a new Storage Class (on publick8s and privatek8s).
Sidenote: we will have to handle the boostrap permissions from the terraform managed volumes.
Reproduction steps
No response
WIP (infra and release)
- [x] Add a new storage class for standard ssd ZRS on private and public - https://github.com/jenkins-infra/azure/pull/672
WEEKLY.CI first
- [x] take a snapshot for safety - jenkins-weekly-snapshot-20240515-0907
- [x] add PV/PVC/DISK for weekly.ci with terraform (https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/persistent_volume and https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/persistent_volume_claim). ~~only creating the PVC should be enough: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#dynamic~~ see https://github.com/jenkins-infra/helpdesk/issues/4044#issuecomment-2121849985
- [x] create a temporary pod with both the new PVC as RW and the Weekly PVC as ReadOnly
- [x] do a timed rsync test between disk
when time ok:
- [x] take a new snapshot for security (remove the one done above)
- [x] disable kubernetes-management builds
- [x] remove the weekly statefulset
- [x] on the temporary pod re-run the rsync (should be fast)
- [x] change the helm values for weekly to use the new PVC : persistence.existingClaim to PVC_NAME
- [x] enable kubernetes-management builds
- [x] merge the PR (changing the chart)
- [x] check that the pod start with the correct PV/PVC
- [x] remove the OLD PV/PVC/DISK (keep the snapshot for a few days)
- [x] remove the snapshot
if all went well, redo for infra.ci/release.ci
Update:
- New storage classes added on the 2 clusters in https://github.com/jenkins-infra/azure/pull/672
- weekly.ci planned to be migrated next week (Monday 29/Tuesday 30)
Update: on hold until after the 15th of May 2024
The aim is to be able to change the disk type without recreating everything next time. We choose to create PV/PVC/Disk from terraform instead of just the PVC for that (not following only creating the PVC should be enough: kubernetes.io/docs/concepts/storage/persistent-volumes/#dynamic) The disk size can be changed on both scenarii.
current state of test of the temporary migration pod:
Events:
Type Reason Age From Message
Normal Scheduled 2m11s default-scheduler Successfully assigned jenkins-weekly/migrate-volume to aks-arm64small2-30051376-vmss00001i
Warning FailedAttachVolume 2m11s attachdetach-controller Multi-Attach error for volume "pvc-<redacted>" Volume is already used by pod(s) jenkins-weekly-0
Normal SuccessfulAttachVolume 119s attachdetach-controller AttachVolume.Attach succeeded for volume "jenkins-weekly-pv"
Warning FailedMount 8s kubelet Unable to attach or mount volumes: unmounted volumes=[jenkins-home-source], unattached volumes=[jenkins-home-source], failed to process volumes=[] timed out waiting for the condition
infos: https://medium.com/@golusstyle/demystifying-the-multi-attach-error-for-volume-causes-and-solutions-595a19316a0c
EDIT :
may need to change the current PVC to RWC ReadWriteMany to be able to mount it on a second pod for migration
I tried with pod affinity:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions: # app.kubernetes.io/instance: jenkins-weekly
- key: app.kubernetes.io/instance
operator: In
values:
- jenkins-weekly
topologyKey: app.kubernetes.io/instance
without any luck
0/8 nodes are available: 1 Insufficient cpu, 2 node(s) didn't match pod affinity rules, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 3 node(s) didn't match P │
│ od's node affinity/selector. preemption: 0/8 nodes are available: 1 No preemption victims found for incoming pod, 7 Preemption is not helpful for scheduling..
pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) didn't match pod affinity rules
will try with node selector directly
I tried with pod affinity:
affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: # app.kubernetes.io/instance: jenkins-weekly - key: app.kubernetes.io/instance operator: In values: - jenkins-weekly topologyKey: app.kubernetes.io/instancewithout any luck
0/8 nodes are available: 1 Insufficient cpu, 2 node(s) didn't match pod affinity rules, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 3 node(s) didn't match P │ │ od's node affinity/selector. preemption: 0/8 nodes are available: 1 No preemption victims found for incoming pod, 7 Preemption is not helpful for scheduling.. pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) didn't match pod affinity ruleswill try with node selector directly
Don't forget that, in any case, you need the tolerations to schedule on arm64 nodes such as this one (ref. https://github.com/jenkins-infra/kubernetes-management/blob/899229e1620277d3750ed261417703a073a4736d/config/jenkins_weekly.ci.jenkins.io.yaml#L29-L35) which might explains why pod affinity was necessary but not sufficient
temporary pod definition:
apiVersion: v1
kind: Pod
metadata:
name: migrate-volume
labels:
name: migrate-volume
namespace: jenkins-weekly
spec:
containers:
- image: debian
name: migrate-volume-script
command: ["rsync"]
args: ["-a", "/var/jenkins_home", "/mnt/"]
volumeMounts:
- mountPath: /var/jenkins_home
name: jenkins-home-source
readOnly: true
- mountPath: /mnt
name: jenkins-home-destination
resources:
requests:
memory: "1Gi"
cpu: "1000m"
limits:
memory: "1Gi"
cpu: "1000m"
nodeSelector:
kubernetes.io/arch: arm64
tolerations:
- key: "kubernetes.io/arch"
operator: "Equal"
value: "arm64"
effect: "NoSchedule"
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions: # app.kubernetes.io/instance: jenkins-weekly
- key: app.kubernetes.io/instance
operator: In
values:
- jenkins-weekly
topologyKey: app.kubernetes.io/instance
restartPolicy: Never
volumes:
- name: jenkins-home-source
persistentVolumeClaim:
claimName: jenkins-weekly
- name: jenkins-home-destination
persistentVolumeClaim:
claimName: jenkins-weekly-data
first try:
- forcing the temporary pod on the same node as the jenkins-weekly + affinity fail with :
0/8 nodes are available: 1 Insufficient cpu, 2 node(s) had untolerated taint {CriticalAddonsOnly:
true}, 5 node(s) didn't match Pod's node affinity/selector. preemption: 0/8 nodes are available: 1 No preemption victims found for incoming pod, 7 Preemption is not helpful for scheduling..
so I though that was the 1 Insufficient cpu matching the node hosting jenkins-weekly
second try:
- having the temporary pod pending with affinity
- delete jenkins-weekly pod to watch if it spawn a new node and start both pods on the new node
the jenkins-weeklypod stayed on the same node
third try:
- having the temporary pod pending with affinity
- manually spawn a news arm node in the node pool
- delete jenkins-weekly pod to watch if both pods starts on the new node
The jenkins-weekly pod started on the new node: aks-arm64small2-30051376-vmss00001p/10.245.0.13
But the temporary pod stayed pending with :
0/9 nodes are available: 2 node(s) had untolerated taint {CriticalAddonsOnly:
true}, 3 node(s) didn't match Pod's node affinity/selector, 4 node(s) didn't
match pod affinity rules. preemption: 0/9 nodes are available: 9 Preemption is
not helpful for scheduling..
New try:
- spawn manually a new node in armsmall nodepool
- delete the jenkins-weekly pod to see if it start again on the new pod
- start the temporary pod when weekly is on the new node.
New try:
- spawn manually a new node in armsmall nodepool
- delete the jenkins-weekly pod to see if it start again on the new pod
- start the temporary pod when weekly is on the new node.
still the same behavior:
0/9 nodes are available: 2 node(s) had untolerated taint {CriticalAddonsOnly:
true}, 3 node(s) didn't match Pod's node affinity/selector, 4 node(s) didn't
match pod affinity rules. preemption: 0/9 nodes are available: 9 Preemption is
not helpful for scheduling..
found it affinity was WRONG topologyKey need to be hostname not instance as it refer to the node:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/instance
operator: In
values:
- jenkins-weekly
topologyKey: kubernetes.io/hostname
So the process to follow is :
- spawn a new compatible node for both pods (arm64)
- delete the
jenkins-weeklypod and check that it re-spawn on the new node - create the
temporarypod that will spawn on the new node because of affinity
When removing the resources request it can be spawned on the existing node of jenkins-weekly, no need to use a brand new node 🎉
resources:
requests:
memory: "1Gi"
cpu: "1000m"
limits:
memory: "1Gi"
cpu: "1000m"
final version of the pod migration for jenkins-weekly will rsync data
- from /var/jenkins_home on pvc
jenkins-weekly - to /mnt/ on pvc
jenkins-weekly-data
apiVersion: v1
kind: Pod
metadata:
name: migrate-volume
labels:
name: migrate-volume
namespace: jenkins-weekly
spec:
securityContext:
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
containers:
- image: jenkinsciinfra/packaging:latest
name: migrate-volume-script
command: ["rsync"]
args: ["-a", "--delete", "/var/jenkins_home", "/mnt/"] #will create the destination folder within /mnt/
volumeMounts:
- mountPath: /var/jenkins_home
name: jenkins-home-source
readOnly: true
- mountPath: /mnt
name: jenkins-home-destination
nodeSelector:
kubernetes.io/arch: arm64
tolerations:
- key: "kubernetes.io/arch"
operator: "Equal"
value: "arm64"
effect: "NoSchedule"
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/instance
operator: In
values:
- jenkins-weekly
topologyKey: kubernetes.io/hostname
restartPolicy: Never
volumes:
- name: jenkins-home-source
persistentVolumeClaim:
claimName: jenkins-weekly
- name: jenkins-home-destination
persistentVolumeClaim:
claimName: jenkins-weekly-data
jenkins-weekly cleanup
disk premium removed pvc-d6f52829-b9a9-4d0e-bd38-777005a77d08
PV removed
old snapshot jenkins-weekly-snapshot-20240515-0907 removed
new snapshot done, to remove later
migration for release.ci
- [x] create PV/PVC/DISK for release.ci on privatek8s - https://github.com/jenkins-infra/azure/pull/768
- [x] run migration pod to prepare new disk - details
- [x] make a snapshot of old volume (premium) - Snapshot.backup-release-ci-20240703-1120-20240703112120
- [x] prepare PR of migration to have the checks - https://github.com/jenkins-infra/kubernetes-management/pull/5379
- [x] prepare release.ci for shutdown to lock new builds
- [x] when builds all done
- [x] disable kubernetes management
- [x] scaleset to 0 (to lock jenkins_home)
- [x] launch migration pod again (should be fast)
- [x] merge PR of migration - https://github.com/jenkins-infra/kubernetes-management/pull/5379
- [x] enable kubenertes management (launch build) - https://infra.ci.jenkins.io/job/kubernetes-jobs/job/kubernetes-management/job/main/37565/pipeline-graph/
- [x] check release.ci use new PV/PVC/DISK
--- something wrong on the UI, missing data SEE POSTMORTEM below
- [x] cleanup old volume/PV (24h later) - pvc-4d27fa9e-2d4f-4a44-88fa-862996ca3706
- [x] cleanup snapshot (1 week later) -
------------------------------ POSTMORTEM the rsync was false:
command: ["rsync"]
args: ["-v", "-a", "--delete", "/var/jenkins_home", "/mnt/"] #will create the destination folder within /mnt/
we don't need the jenkins_home folder to be created in /mnt/ the correct command is:
command: ["rsync"]
args: ["-v", "-a", "--delete", "/var/jenkins_home/", "/mnt/"]
pod template for migration for jenkins-release
apiVersion: v1
kind: Pod
metadata:
name: migrate-volume
labels:
name: migrate-volume
namespace: jenkins-release
spec:
securityContext:
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
containers:
- image: jenkinsciinfra/packaging:latest
name: migrate-volume-script
command: ["rsync"]
args: ["-v", "-a", "--delete", "/var/jenkins_home/", "/mnt/"] #will create the destination folder within /mnt/
volumeMounts:
- mountPath: /var/jenkins_home
name: jenkins-home-source
readOnly: true
- mountPath: /mnt
name: jenkins-home-destination
nodeSelector:
kubernetes.io/arch: arm64
kubernetes.azure.com/agentpool: releacictrl
tolerations:
- key: "kubernetes.io/arch"
operator: "Equal"
value: "arm64"
effect: "NoSchedule"
- key: "jenkins"
operator: "Equal"
value: "release.ci.jenkins.io"
effect: "NoSchedule"
- key: "jenkins-component"
operator: "Equal"
value: "controller"
effect: "NoSchedule"
restartPolicy: Never
volumes:
- name: jenkins-home-source
persistentVolumeClaim:
claimName: jenkins-release
- name: jenkins-home-destination
persistentVolumeClaim:
claimName: jenkins-release-data
migration for infra.ci
- [x] create PV/PVC/DISK for infra.ci on privatek8s - https://github.com/jenkins-infra/azure/pull/771
- [x] run migration pod to prepare new disk - details
- [x] make a snapshot of old volume (premium) -
jenkins-infra-data-20240704-15h30Z - [x] prepare PR of migration to have the checks - https://github.com/jenkins-infra/kubernetes-management/pull/5388
- [x] prepare infra.ci for shutdown to lock new builds
- [x] when builds all done
- [x] disable kubernetes management
- [x] scaleset to 0 (to lock jenkins_home)
- [x] launch migration pod again (should be fast)
- [x] merge PR of migration -
- [x] enable kubernetes management (launch build) -
- [x] check infra.ci use new PV/PVC/DISK
- [x] cleanup old volume/PV (24h later) -
- [x] cleanup snapshot (1 week later) -
pod template for migration for jenkins-infra
apiVersion: v1
kind: Pod
metadata:
name: migrate-volume
labels:
name: migrate-volume
namespace: jenkins-infra
spec:
securityContext:
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
containers:
- image: jenkinsciinfra/packaging:latest
name: migrate-volume-script
command: ["rsync"]
args: ["-v", "-a", "--delete", "/var/jenkins_home/", "/mnt/"]
volumeMounts:
- mountPath: /var/jenkins_home
name: jenkins-home-source
readOnly: true
- mountPath: /mnt
name: jenkins-home-destination
nodeSelector:
kubernetes.io/arch: arm64
kubernetes.azure.com/agentpool:infracictrl
tolerations:
- key: "kubernetes.io/arch"
operator: "Equal"
value: "arm64"
effect: "NoSchedule"
- key: "jenkins"
operator: "Equal"
value: "infra.ci.jenkins.io"
effect: "NoSchedule"
- key: "jenkins-component"
operator: "Equal"
value: "controller"
effect: "NoSchedule"
restartPolicy: Never
volumes:
- name: jenkins-home-source
persistentVolumeClaim:
claimName: jenkins-infra
- name: jenkins-home-destination
persistentVolumeClaim:
claimName: jenkins-infra-data
Update:
- Cleaned left over PVCs (unbound ones)
- No more PV to clean up
- Cleaned up letf over disks (unmounted ones)
- Cleaned up snapshots (unattached ones)