vcluster icon indicating copy to clipboard operation
vcluster copied to clipboard

anti affinity missing on synced pod if namespace selector matches all namespaces

Open ytizhang opened this issue 1 year ago • 3 comments

What happened?

We are installing multiple charts in different namespaces, and for some pods, they need pod anti affinity across all the namespaces, to make sure no two pods from the same role are scheduled to the same node.

affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: sfcn.cisco.com/enforcement-point
                operator: Exists
          topologyKey: kubernetes.io/hostname
          namespaceSelector: {}

The "sfcn.cisco.com/enforcement-point" label is used to indicate the pod is an enforcer role, and no two enforcer pods should be scheduled to the same node, regardless of namespace. However, when this pod is synced to the host cluster, the "sfcn.cisco.com/enforcement-point" label is missing. We are seeing the following anti affinity rule, which matches all the pods installed in the corresponding vcluster.

  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              vcluster.loft.sh/managed-by: <vcluster-name>
          topologyKey: kubernetes.io/hostname

What did you expect to happen?

I expect to see the matchExpressions preserved in the synced pod

matchExpressions:
    - key: sfcn.cisco.com/enforcement-point
      operator: Exists

How can we reproduce it (as minimally and precisely as possible)?

Create a pod using the following spec in a vcluster

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
    - name: aws-cli
      image: amazon/aws-cli:latest
      command:
        - /bin/bash
        - '-c'
        - sleep infinity
      imagePullPolicy: Always
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: sfcn.cisco.com/enforcement-point
                operator: Exists
          topologyKey: kubernetes.io/hostname
          namespaceSelector: {}

Go to the host cluster, read the pod spec, you'll have the following affinity config

affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              vcluster.loft.sh/managed-by: <vcluster-name>
          topologyKey: kubernetes.io/hostname

Anything else we need to know?

No response

Host cluster Kubernetes version

$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.16", GitCommit:"c5f43560a4f98f2af3743a59299fb79f07924373", GitTreeState:"clean", BuildDate:"2023-11-15T22:39:12Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.16-eks-8cb36c9", GitCommit:"3a3ea80e673d7867f47bdfbccd4ece7cb5f4a83a", GitTreeState:"clean", BuildDate:"2023-11-22T21:53:22Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}

Host cluster Kubernetes distribution

EKS

vlcuster version

$ vcluster --version
vcluster version 0.18.1

Vcluster Kubernetes distribution(k3s(default)), k8s, k0s)

vcluster-eks

OS and Arch

OS: N/A
Arch: N/A

ytizhang avatar Jan 29 '24 22:01 ytizhang

@ytizhang thanks for creating this issue, mhh I think that should work, can you send the rewritten pod yaml from the host cluster?

The relevant implementation is here: https://github.com/loft-sh/vcluster/blob/be19d090ebc5d6bc7275cca68d062bbd608ca097/pkg/controllers/resources/pods/translate/translator.go#L772

FabianKramm avatar Jan 31 '24 10:01 FabianKramm

Here is the rewritten pod yaml

apiVersion: v1
kind: Pod
metadata:
  name: test-pod-x-default-x-yitzhang-veks
  namespace: yitzhang
  uid: 5fa326da-f0b0-4d8e-82bd-9182698e085c
  resourceVersion: '948445374'
  creationTimestamp: '2024-02-02T22:31:02Z'
  labels:
    vcluster.loft.sh/managed-by: yitzhang-veks
    vcluster.loft.sh/namespace: default
    vcluster.loft.sh/ns-label-yitzhang-veks-x-cf1227b7b2: default
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: 'false'
    cni.projectcalico.org/containerID: 36404db2362f9160e61f414984be7592d9830baba782fe545bceec457fc27398
    cni.projectcalico.org/podIP: 192.168.59.146/32
    cni.projectcalico.org/podIPs: 192.168.59.146/32
    kubectl.kubernetes.io/last-applied-configuration: >
      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"test-pod","namespace":"default"},"spec":{"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"sfcn.cisco.com/enforcement-point","operator":"Exists"}]},"namespaceSelector":{},"topologyKey":"kubernetes.io/hostname"}]}},"containers":[{"command":["/bin/bash","-c","sleep
      infinity"],"image":"amazon/aws-cli:latest","imagePullPolicy":"Always","name":"aws-cli"}]}}
    vcluster.loft.sh/labels: ''
    vcluster.loft.sh/managed-annotations: kubectl.kubernetes.io/last-applied-configuration
    vcluster.loft.sh/name: test-pod
    vcluster.loft.sh/namespace: default
    vcluster.loft.sh/object-name: test-pod
    vcluster.loft.sh/object-namespace: default
    vcluster.loft.sh/object-uid: bd2375e4-c239-4bdc-a4c8-740a3a301378
    vcluster.loft.sh/service-account-name: default
    vcluster.loft.sh/uid: bd2375e4-c239-4bdc-a4c8-740a3a301378
spec:
  volumes:
    - name: kube-api-access-w4lw5
      projected:
        sources:
          - downwardAPI:
              items:
                - path: token
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.annotations['vcluster.loft.sh/token-kfpixaor']
                  mode: 420
          - configMap:
              name: kube-root-ca.crt-x-default-x-yitzhang-veks
              items:
                - key: ca.crt
                  path: ca.crt
          - downwardAPI:
              items:
                - path: namespace
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.annotations['vcluster.loft.sh/namespace']
        defaultMode: 420
  containers:
    - name: aws-cli
      image: amazon/aws-cli:latest
      command:
        - /bin/bash
        - '-c'
        - sleep infinity
      env:
        - name: KUBERNETES_PORT
          value: tcp://172.20.144.56:443
        - name: KUBERNETES_PORT_443_TCP
          value: tcp://172.20.144.56:443
        - name: KUBERNETES_PORT_443_TCP_ADDR
          value: 172.20.144.56
        - name: KUBERNETES_PORT_443_TCP_PORT
          value: '443'
        - name: KUBERNETES_PORT_443_TCP_PROTO
          value: tcp
        - name: KUBERNETES_SERVICE_HOST
          value: 172.20.144.56
        - name: KUBERNETES_SERVICE_PORT
          value: '443'
        - name: KUBERNETES_SERVICE_PORT_HTTPS
          value: '443'
      resources: {}
      volumeMounts:
        - name: kube-api-access-w4lw5
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: Always
  restartPolicy: Always
  terminationGracePeriodSeconds: 30
  dnsPolicy: None
  serviceAccountName: default-x-default-x-yitzhang-veks
  serviceAccount: default-x-default-x-yitzhang-veks
  automountServiceAccountToken: false
  nodeName: ip-10-216-64-90.us-west-2.compute.internal
  securityContext: {}
  hostname: test-pod
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              vcluster.loft.sh/managed-by: yitzhang-veks
          topologyKey: kubernetes.io/hostname
  schedulerName: default-scheduler
  tolerations:
    - key: node.kubernetes.io/not-ready
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
    - key: node.kubernetes.io/unreachable
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
  hostAliases:
    - ip: 172.20.144.56
      hostnames:
        - kubernetes
        - kubernetes.default
        - kubernetes.default.svc
  priority: 0
  dnsConfig:
    nameservers:
      - 172.20.241.77
    searches:
      - default.svc.cluster.local
      - svc.cluster.local
      - cluster.local
    options:
      - name: ndots
        value: '5'
  enableServiceLinks: false
  preemptionPolicy: PreemptLowerPriority

ytizhang avatar Feb 02 '24 22:02 ytizhang

@FabianKramm Upon reading the code at https://github.com/loft-sh/vcluster/blob/be19d090ebc5d6bc7275cca68d062bbd608ca097/pkg/controllers/resources/pods/translate/translator.go#L772, when there is NamespaceSelector in vPod, the translated label selector will override the ones set in https://github.com/loft-sh/vcluster/blob/be19d090ebc5d6bc7275cca68d062bbd608ca097/pkg/controllers/resources/pods/translate/translator.go#L735

ytizhang avatar Feb 02 '24 23:02 ytizhang