kubernetes icon indicating copy to clipboard operation
kubernetes copied to clipboard

PV nodeAffinity matchExpressions problem with array items and `in` operator since 1.27.0

Open dsgnr opened this issue 1 year ago • 18 comments

What happened?

Since v1.27.0, the nodeAffinity for PersistentVolumes appear to have added extra validation on whether the values exist, despite using the in operator. This works as expected on 1.26.14, and no longer works on 1.27.0 onwards (tested up to 1.29.2).

I am unsure whether this is an intentional breaking change in 1.27.0 or something wrong in our manifests and we've just so happened to get away with it for a few years. The only thing that stands out to me in the changelog for 1.27.0 is some changes to the schedulers Filter plugin but this may well be unrelated.

What did you expect to happen?

When using the in operator, I'd expect the expression to only apply to nodes where the match is truey.

How can we reproduce it (as minimally and precisely as possible)?

Note, using Kind for easy reproduction

v1.26.14 - working

creating a kind cluster

cat <<EOF > kind_conf
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:v1.26.14
- role: worker
  image: kindest/node:v1.26.14
- role: worker
  image: kindest/node:v1.26.14
EOF

$ kind create cluster --name 12614-test --config kind_conf
$ k get no
NAME                       STATUS   ROLES           AGE   VERSION
12614-test-control-plane   Ready    control-plane   8h    v1.26.14
12614-test-worker          Ready    <none>          8h    v1.26.14
12614-test-worker2         Ready    <none>          8h    v1.26.14
# labelling a node to ensure pod is scheduled on specific worker
$ k label node 12614-test-worker nginx=nginx
cat <<EOF > manifest.yaml
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: nginx
  name: nginx
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  serviceName: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nginx
                operator: Exists
      containers:
      - name: nginx
        image: nginx:latest
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /var/lib/nginx
          name: default-nginx-data
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: default-nginx-data
    spec:
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 5Gi
      storageClassName: my_sc
      volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: default-nginx-data-nginx-0
  namespace: default
  labels:
    app: default-nginx-data-nginx-0
spec:
  storageClassName: my_sc
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: /srv/k8s/default/nginx-0
  claimRef:
    namespace: default
    name: default-nginx-data-nginx-0
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
            - 12614-test-worker
            - 12614-prod-worker
EOF
# Apply the manifest
$ k apply -f manifest.yaml

Since 12614-test-worker exists in the hostname values of this cluster, the pod schedules as expected. 12614-prod-worker node does not exist in this environment, and so is ignored.

$ k get po
NAME      READY   STATUS    RESTARTS   AGE
nginx-0   1/1     Running   0          6m2s

v1.27.0 - not working

creating a kind cluster

cat <<EOF > kind_conf
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:v1.27.0
- role: worker
  image: kindest/node:v1.27.0
- role: worker
  image: kindest/node:v1.27.0
EOF

$ kind create cluster --name 1270-test --config kind_conf
# labelling a node to ensure pod is scheduled on specific worker
$ k label node 1270-test-worker nginx=nginx
cat <<EOF > manifest.yaml
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: nginx
  name: nginx
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  serviceName: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nginx
                operator: Exists
      containers:
      - name: nginx
        image: nginx:latest
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /var/lib/nginx
          name: default-nginx-data
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: default-nginx-data
    spec:
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 5Gi
      storageClassName: my_sc
      volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: default-nginx-data-nginx-0
  namespace: default
  labels:
    app: default-nginx-data-nginx-0
spec:
  storageClassName: my_sc
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: /srv/k8s/default/nginx-0
  claimRef:
    namespace: default
    name: default-nginx-data-nginx-0
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
            - 1270-test-worker
            - 1270-prod-worker
EOF
# Apply the manifest
$ k apply -f manifest.yaml

Same expectations as 1.26.14, but in this case, the pod stays stuck in pending state due to the following;

$ k get no
NAME                      STATUS   ROLES           AGE     VERSION
1270-test-control-plane   Ready    control-plane   5m1s    v1.27.0
1270-test-worker          Ready    <none>          4m34s   v1.27.0
1270-test-worker2         Ready    <none>          4m39s   v1.27.0
$ k get po
NAME      READY   STATUS    RESTARTS   AGE
nginx-0   0/1     Pending   0          5m56s
$ kubectl events --for pod/nginx-0
LAST SEEN   TYPE      REASON             OBJECT        MESSAGE
3m6s        Warning   FailedScheduling   Pod/nginx-0   nodeinfo not found for node name "1270-prod-worker"

I have tested this all the way up to 1.29.2.

Anything else we need to know?

No response

Kubernetes version

v1.27.0 through to v1.29.2

Cloud provider

OS version

[root@stage1 ~]# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

[root@stage1 ~]# uname -a
Linux stage1.domain.com 3.10.0-1160.108.1.el7.x86_64 #1 SMP Thu Jan 25 16:17:31 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

dsgnr avatar Feb 23 '24 08:02 dsgnr

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 23 '24 08:02 k8s-ci-robot

/sig storage

vaibhav2107 avatar Feb 23 '24 09:02 vaibhav2107

In a similar vein, we have noticed the same issue in K8s v1.27 when the kubernetes.io/hostname label doesn't match the node name. We see nodeinfo not found for <hostname>, despite (I think) the nodeinfo util supposed to be called using a node name.

msarfaty avatar Feb 23 '24 17:02 msarfaty

/sig scheduling

PrasanaaV avatar Feb 26 '24 20:02 PrasanaaV

This is because the non-existent node 1270-prod-worker is obtained from the cache, which has been fixed by #119779

before 119779, get node will return error: https://github.com/kubernetes/kubernetes/blob/bd2cb7205ee631b7d6c32ba5d7514ef75f011622/pkg/scheduler/schedule_one.go#L434-L444

after 119779 it's fixed https://github.com/kubernetes/kubernetes/blob/da6be3b71826ca100d3aeeeea289a1a3272d8aca/pkg/scheduler/schedule_one.go#L485-L497

since version 1.30, it can run normally

@sanposhiho PTAL, need to cherry-pick for the previous version?

chengjoey avatar Apr 18 '24 07:04 chengjoey

Looks like my PR https://github.com/kubernetes/kubernetes/pull/119779 accidentally fixed this issue. 😅

https://github.com/kubernetes/kubernetes/commit/380c7f248e4ecc2a104b11652b6171ab65501cba is the trigger of this issue, introduced in v1.27; volumebinding plugin just returns node names in PreFilterResult based on nodeAffinity in PV, and thus PreFilterResult could contain node names actually doesn't exist.


@kubernetes/sig-scheduling-leads

One Line Summary: If nodeaffinity in PV has non-existing node name in In with kubernetes.io/hostname, the scheduling always fails. And, this bug is already coincidentally fixed by #119779.

Questions:

  • In the first place, is it a bug worth cherry-picking?
  • If Yes, should we cherry-pick #119779, which actually isn't the PR motivated for this issue? Or should we go another way

sanposhiho avatar Apr 20 '24 05:04 sanposhiho

If it's a regression, we should consider cherry-pick.

🤔 https://github.com/kubernetes/kubernetes/pull/119779 looks big, but the actual code changes are about 10 lines (the rest are comments and tests), so I think it would be ok to cherry-pick.

Can you explain why your PR fixes the problem?

alculquicondor avatar Apr 24 '24 20:04 alculquicondor

Can you explain why your PR fixes the problem?

Before my PR, we take NodeInfo of nodes in preFilterResult via sched.nodeInfoSnapshot.NodeInfos().Get(n) (ref). Consequently, we could get not found errors from it when preFilterResult has non-existing node name(s).

And, after my PR, we are not using sched.nodeInfoSnapshot.NodeInfos().Get(n) anymore.

sanposhiho avatar Apr 25 '24 02:04 sanposhiho

The pitfall is that when PreFilterResult was introduced, scheduler directly fetch the nodeNames returned by PreFilter() in its node snapshot, so technically speaking, when a plugin (in-tree or out-of-tree) return non-existent/illegal nodes, the pod scheduling flow will abort immediately.

And the PR #109877 happened to stomp on this pitfall.

Huang-Wei avatar Apr 25 '24 17:04 Huang-Wei

@Huang-Wei do you mean that you would like to see a smaller version of #119779 for cherry-pick? Note that it is only about 10 lines of actual change. The rest is comments and tests.

Since this is already released code, I would suggest cherry-picking as-is.

Also, worth testing E2E whether the issue reproduces in v1.30.

@dsgnr do you think you can give it a try?

alculquicondor avatar Apr 25 '24 18:04 alculquicondor

@Huang-Wei do you mean that you would like to see a smaller version of #119779 for cherry-pick? Note that it is only about 10 lines of actual change. The rest is comments and tests.

yup, i'm in favor of picking the minimum fix, along with a minimum test #124539:

		for _, n := range allNodes {
			if !preRes.NodeNames.Has(n.Node().Name) {
				// We consider Nodes that are filtered out by PreFilterResult as rejected via UnschedulableAndUnresolvable.
				// We have to record them in NodeToStatusMap so that they won't be considered as candidates in the preemption.
				diagnosis.NodeToStatusMap[n.Node().Name] = framework.NewStatus(framework.UnschedulableAndUnresolvable, "node is filtered out by the prefilter result")
				continue
			}
			nodes = append(nodes, n)

b/c in addition to the "fix" part, #119779 actually introduced a slight behavior change. @sanposhiho do you think the above ^^ minimum fix + #124539 makes sense?

Huang-Wei avatar Apr 25 '24 18:04 Huang-Wei

Also, worth testing E2E whether the issue reproduces in v1.30.

#124539 can guard it.

Huang-Wei avatar Apr 25 '24 18:04 Huang-Wei

@Huang-Wei do you mean that you would like to see a smaller version of #119779 for cherry-pick? Note that it is only about 10 lines of actual change. The rest is comments and tests.

yup, i'm in favor of picking the minimum fix, along with a minimum test #124539:

		for _, n := range allNodes {
			if !preRes.NodeNames.Has(n.Node().Name) {
				// We consider Nodes that are filtered out by PreFilterResult as rejected via UnschedulableAndUnresolvable.
				// We have to record them in NodeToStatusMap so that they won't be considered as candidates in the preemption.
				diagnosis.NodeToStatusMap[n.Node().Name] = framework.NewStatus(framework.UnschedulableAndUnresolvable, "node is filtered out by the prefilter result")
				continue
			}
			nodes = append(nodes, n)

b/c in addition to the "fix" part, #119779 actually introduced a slight behavior change. @sanposhiho do you think the above ^^ minimum fix + #124539 makes sense?

@Huang-Wei @alculquicondor @sanposhiho Could you let me try it?,picking the minimum fix and add e2e test

chengjoey avatar Apr 26 '24 05:04 chengjoey

A simplified version of the sample is used for testing,manifest.yaml:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: data
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: /tmp/data
  claimRef:
    namespace: default
    name: data
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - 1280-test-worker
            - 1280-prod-worker

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data
  namespace: default
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 5Gi
  volumeName: data
---
apiVersion: v1
kind: Pod
metadata:
  name: pause
  labels:
    app: pause
spec:
  containers:
  - name: pause
    image: registry.k8s.io/pause:3.6
    volumeMounts:
    - mountPath: /tmp
      name: data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: data

also get this failedScheduling error, kubectl describe pod pause: image

chengjoey avatar Apr 26 '24 09:04 chengjoey

@sanposhiho do you think the above ^^ minimum fix + https://github.com/kubernetes/kubernetes/pull/124539 makes sense?

@Huang-Wei That small fix should work. I don't have a strong opinion on whether we should have a minimum fix vs cherry-pick as it is. Defer to you both @alculquicondor @Huang-Wei on that point.

sanposhiho avatar Apr 26 '24 09:04 sanposhiho

Let's move ahead with the small fix, since we all agree on it.

alculquicondor avatar Apr 26 '24 13:04 alculquicondor

/reopen

Cherry-pick PR(s) aren't done

sanposhiho avatar Apr 28 '24 12:04 sanposhiho

@sanposhiho: Reopened this issue.

In response to this:

/reopen

Cherry-pick PR(s) aren't done

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 28 '24 12:04 k8s-ci-robot

FYI on a related crash #124930

alculquicondor avatar May 17 '24 15:05 alculquicondor

Are there any news on the backporting the solution to older versions (v1.27+)?

0xdiba avatar Jun 04 '24 18:06 0xdiba

This is already backported and released. But the fix for the crash reported in https://github.com/kubernetes/kubernetes/issues/124930 was not released yet.

Let's close this issue for now, as it is completed.

alculquicondor avatar Jun 04 '24 18:06 alculquicondor

/close

alculquicondor avatar Jun 04 '24 18:06 alculquicondor

@alculquicondor: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 04 '24 18:06 k8s-ci-robot

The issue here stems from the observation by Mike Sarfaty, above that volume-binding's prefilter assumes that it can equate the matchConstraint hostname labels with node names. This doesn't have to be the case. See #125336 here. Prefiltering will always fail in this situation. AFAICT this issue persists with the current head of the master branch.

jan-g avatar Jun 05 '24 11:06 jan-g