cockroach-operator icon indicating copy to clipboard operation
cockroach-operator copied to clipboard

Add affinity rules feature gate on job resource

Open blong14 opened this issue 4 years ago • 4 comments

This PR adds:

  • Feature gate check for affinity rules on job resource

Why:

  • In testing the operator in a home lab setup (4 node k3s cluster; 3 of which are rasp pi) the version check job was not respecting the affinity rules and was getting scheduled on the control plan node. This caused issues as I'm running a custom build of cockroachdb for arm. Adding the feature gate check allows the v check container to be properly scheduled on one of the rasp pi worker nodes.

Notes:

  • I did try to add a test but ran into some issues with getting bazel to recognize the new test. Happy to continue looking into that but would love a nudge here and there on how to do that properly.
  • I have tried my best to tear down the cluster and reinstall to make sure this is doing what I think it should. Mainly using kubectl get pod <cockroach v check container> -o yaml to confirm the affinity rules are specified.
  • I also recognize that my setup isn't currently supported so there may be reasons this change shouldn't be added. Looking forward to the conversation none the less.

Below is my cluster.yaml

apiVersion: crdb.cockroachlabs.com/v1alpha1
kind: CrdbCluster
metadata:
  # this translates to the name of the statefulset that is created
  name: cockroachdb
spec:
  dataStore:
    pvc:
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: "60Gi"
        volumeMode: Filesystem
  resources:
    requests:
      #cpu: "2"
      memory: "1Gi"
    limits:
      #cpu: "2"
      memory: "1Gi"
  tlsEnabled: true
# You can set either a version of the db or a specific image name
# cockroachDBVersion: v21.1.5
  image:
    name: blong14/cockroachdb:v20.2.2
    pullPolicy: Always
  # nodes refers to the number of crdb pods that are created
  # via the statefulset 
  nodes: 3
  # affinity is a new API field that is behind a feature gate that is
  # disabled by default.  To enable please enable, see operator.yaml
  # The affinity field will accept any podSpec affinity rule.
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - worker-01
            - worker-02
            - worker-03

blong14 avatar Aug 07 '21 21:08 blong14

CLA assistant check
All committers have signed the CLA.

cockroach-teamcity avatar Aug 07 '21 21:08 cockroach-teamcity

Hi, thank you for the PR. I'm a bit surprised about the jobs scheduling on the control plane. Do you have some added details? I'm worried this might be a more general problem. I just want to check on when we generate jobs and if adding the same affinity rules could cause issues.

udnay avatar Aug 09 '21 22:08 udnay

Hi, thank you for the PR. I'm a bit surprised about the jobs scheduling on the control plane. Do you have some added details? I'm worried this might be a more general problem. I just want to check on when we generate jobs and if adding the same affinity rules could cause issues.

Hi, sorry, I'll clean up some of my bad terminology by describing what I was seeing differently.

I have 4 schedulable nodes and after setting the node affinity rules for the CrdCluster definition, I noticed that the version check pod was scheduled on a node not in the affinity match for the cluster. The actual database pods were properly being scheduled. I dug in a little and noticed that the job resource didn't try to set any affinity rules when creating the v check pod.

This fix, if enabled, will pull the affinity rules off the cluster definition and will use those rules for all pods created by the operator. I think that brings up an interesting question. Should the job resource always default to the cluster affinity rules or have its own configuration?

benlong-transloc avatar Aug 10 '21 14:08 benlong-transloc

Maybe we should add a feature gate around the job affinity rules, so that people are not surprised when they enable affinity rules for CRDB pods.

udnay avatar Aug 11 '21 15:08 udnay