cockroach-operator icon indicating copy to clipboard operation
cockroach-operator copied to clipboard

Version check job does not inherit tolerations and nodeSelector from CrdbCluster

Open kuznero opened this issue 3 years ago • 1 comments

I am running v2.4.0 of the operator installed like this:

  • crds v2.4.0 without any modifications
  • operator v2.4.0 with the following modifications:
    • added linkerd.io/inject: enabled to a namespace annotation
    • changed namespace: cockroach-operator-system -> cockroach-system
    • added tolerations and nodeSelector to deployment

My cluster is defined like this:

apiVersion: crdb.cockroachlabs.com/v1alpha1
kind: CrdbCluster
metadata:
  name: cockroachdb
  namespace: cockroachdb
spec:
  dataStore:
    pvc:
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: "10Gi"
        volumeMode: Filesystem
  tlsEnabled: true
  image:
    name: cockroachdb/cockroach:v21.1.11
  nodes: 3
  additionalLabels:
    crdb: is-cool
  tolerations:
    - key: tier
      operator: Equal
      value: platform
      effect: NoSchedule
  nodeSelector:
    tier: platform

My kubernetes setup is a Kind cluster with the following configuration:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  disableDefaultCNI: true
  podSubnet: 192.168.0.0/16
nodes:
  - role: control-plane
    kubeadmConfigPatches:
      - |
        kind: InitConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-labels: "ingress-ready=true"
  - role: worker
    kubeadmConfigPatches:
      - |
        kind: JoinConfiguration
        nodeRegistration:
          taints:
            - key: tier
              value: platform
              effect: NoSchedule
          kubeletExtraArgs:
            node-labels: "tier=platform"
    extraPortMappings:
      # private ingress controller
      - containerPort: 30080
        hostPort: 7080
        protocol: TCP
      - containerPort: 30443
        hostPort: 7443
        protocol: TCP
      # public ingress controller
      - containerPort: 31080
        hostPort: 8080
        protocol: TCP
      - containerPort: 31443
        hostPort: 8443
        protocol: TCP
  - role: worker
    kubeadmConfigPatches:
      - |
        kind: JoinConfiguration
        nodeRegistration:
          taints:
            - key: tier
              value: application
              effect: NoSchedule
          kubeletExtraArgs:
            node-labels: "tier=application"

This cluster runs tigera operator to install Calico CNI that supports network policies (though this information is irrelevant to the issue).

Problem

Default scheduler cannot find a node to schedule a pod for a version check job (cockroachdb-vcheck-...) with the following error:

Warning FailedScheduling 15s (x2 over 93s) default-scheduler 0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 1 node(s) had taint {tier: application}, that the pod didn't tolerate, 1 node(s) had taint {tier: platform}, that the pod didn't tolerate.

kuznero avatar Nov 21 '21 21:11 kuznero

@kuznero try adding the feature flag as documented. You should be good to go then

    - args:
       - -zap-log-level
       - info
       - -feature-gates=TolerationRules=true,AffinityRules=true,TopologySpreadRules=true

lin-crl avatar Apr 26 '23 23:04 lin-crl