postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

Affinity for Backups Jobs

Open maxsivkov opened this issue 2 years ago • 7 comments

Postgres Operator: 5.0.3

Title

Backup Jobs do not respect affinity settings in the repoHost section (should they do?)

Description

In my config I have two nodes labeled with ckrole=db. Master and replica pods are running on a db nodes, each on a different node (thanks to pod anti affinity) SS repo-host also has scheduled its pod to the db node, but it's job doesn't

NAME                                           READY   STATUS              RESTARTS   AGE   IP             NODE                            NOMINATED NODE   READINESS GATES
pod/postgres-cluster-lh-vl-backup-qgkj-fthws   0/1     ContainerCreating   0          62m   <none>         win-worker-01                   <none>           <none>

(we have one windows worker in the cluster)

Deployment

postgres/postgres.yaml file from the postgres-operator-examples repo

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: postgres-cluster-lh-vl
spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-13.4-1
  postgresVersion: 13
  users:
  - name: postgres
  - name: testuser-cl1
    databases:
    - testuser_cl1_db
  - name: testuser-cl2
    databases:
    - testuser_cl2_db
  instances:
  - name: instance1
    replicas: 2
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
              - key: ckrole
                operator: In
                values:
                - db
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: kubernetes.io/hostname
          labelSelector:
            matchLabels:
              postgres-operator.crunchydata.com/cluster: postgres-cluster-lh-vl
              postgres-operator.crunchydata.com/instance-set: instance1
    dataVolumeClaimSpec:
      accessModes:
      - "ReadWriteOnce"
      volumeMode: Filesystem
      storageClassName: longhorn-cluster
      resources:
        requests:
          storage: 1Gi
  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.35-0
      repoHost:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                  - key: ckrole
                    operator: In
                    values:
                    - db
      repos:
      - name: repo1
        volume:
          volumeClaimSpec:
            accessModes:
            - "ReadWriteOnce"
            volumeMode: Filesystem
            storageClassName: longhorn-cluster
            resources:
              requests:
                storage: 1Gi

I didn't find affinity settings for the backup jobs, so I think they should respect repoHost settings. Should they?

maxsivkov avatar Nov 03 '21 16:11 maxsivkov

Backups are taken from the primary, so it is generally better for the backup Job to be a bit closer to the primary host.

The repoHost affinity rules are specifically for the repository, not any of the Jobs.

That all said, we do have it in our roadmap to add affinity rules for Jobs around the backup system.

jkatz avatar Nov 03 '21 18:11 jkatz

That all said, we do have it in our roadmap to add affinity rules for Jobs around the backup system.

Thanks for the response. Do you have any estimations for this enhancement?

Seems that for now the only way out is to taint windows nodes...

maxsivkov avatar Nov 04 '21 13:11 maxsivkov

restore already supports affinity and tolerations, would be good if scheduled and manual backup jobs would support them too so we can schedule them on infra nodes, while the db resides on worker nodes which are more tightly calculated on resources.

QuingKhaos avatar May 16 '22 09:05 QuingKhaos

restore already supports affinity and tolerations, would be good if scheduled and manual backup jobs would support them too so we can schedule them on infra nodes, while the db resides on worker nodes which are more tightly calculated on resources.

especially if the volume holding the data for the database is only available on one node (where the database is actually running). So the backup job MUST run on the same node as the db, as otherwise the backup will fail!

When will the backup/affinity setting will be available?

cdaller avatar May 23 '22 09:05 cdaller

Any indication when the 5.2.0 version of the operator will be pushed to operatorhub.

https://operatorhub.io/operator/postgresql

At the moment my backups get scheduled to arm64 nodes (hybrid cluster arm64 + amd64 nodes) and fail so this feature would solve my problems and be appreciated.

darktempla avatar Sep 09 '22 17:09 darktempla

@darktempla In case you missed it, Crunchy PGO 5.2.0 is now available on OperatorHub.

tjmoore4 avatar Sep 09 '22 21:09 tjmoore4

@darktempla In case you missed it, Crunchy PGO 5.2.0 is now available on OperatorHub.

@tjmoore4 - Literally must have dropped just after I checked before commenting on this issue ;) otherwise a webpage issue got the better of me. Happy chappy thanks for letting me know I will take it for a spin.

darktempla avatar Sep 11 '22 13:09 darktempla

@maxsivkov Affinity for backup Jobs has been added with #3260, so closing this issue.

tjmoore4 avatar Sep 30 '22 14:09 tjmoore4