test-infra ControlPlane node is not ready in scalability tests when run on GCE

In scalability tests, the control-plane node is never initialized to be ready. We're usually not suffering from them as almost all our tests run 100+ nodes and we tollerate 1% of nodes not initialized correctly. But this is problematic for tests like: https://testgrid.k8s.io/sig-scalability-experiments#watchlist-off

Looking into kubelet logs, the reason seem to be:

May 11 09:09:13.886270 bootstrap-e2e-master kubelet[2782]: E0511 09:09:13.886233    2782 kubelet.go:2753] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

FWIW - it seems to be related to some of our preset settings, as, e.g. https://testgrid.k8s.io/sig-scalability-node#node-containerd-throughput

don't suffer from it.

@kubernetes/sig-scalability @mborsz @Argh4k @p0lyn0mial - FYI

May 11 '23 09:05 wojtek-t

The only suspicious one that I see in our preset is this one:

  - name: KUBE_GCE_PRIVATE_CLUSTER
    value: "true"

May 11 '23 09:05 wojtek-t

containerd logs from master:

May 12 08:43:21.379251 bootstrap-e2e-master containerd[650]: time="2023-05-12T08:43:21.379201176Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"

on nodes we get cni config from template: NetworkPluginConfTemplate:/home/kubernetes/cni.template on the master it is empty. In logs from master I can see that setup-containerd is called from configure-helper and it should set the template path. My guess would be that https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/configure-helper.sh#L3181 is executed, but this should not be the case.

May 12 '23 11:05 Argh4k

I have sshed on to the master and it looks like all configuration files regarding cni are in place. Kubectl describe node on master:

Events:
  Type    Reason                Age                 From             Message
  ----    ------                ----                ----             -------
  Normal  RegisteredNode        21m                 node-controller  Node bootstrap-e2e-master event: Registered Node bootstrap-e2e-master in Controller
  Normal  CIDRAssignmentFailed  26s (x56 over 21m)  cidrAllocator    Node bootstrap-e2e-master status is now: CIDRAssignmentFailed

May 12 '23 13:05 Argh4k

Kube controller manager logs:

E0512 13:12:32.119653      11 cloud_cidr_allocator.go:315] "Failed to update the node PodCIDR after multiple attempts" err="failed to patch node CIDR: Node \"bootstrap-e2e-master\" is invalid: spec.podCIDRs: Invalid value: []string{\"10.64.0.0/24\", \"10.40.0.2/32\"}: may specify no more than one CIDR for each IP family" node="bootstrap-e2e-master" cidrStrings=["10.64.0.0/24","10.40.0.2/32"]
E0512 13:12:32.119671      11 cloud_cidr_allocator.go:178] "Error updating CIDR" err="failed to patch node CIDR: Node \"bootstrap-e2e-master\" is invalid: spec.podCIDRs: Invalid value: []string{\"10.64.0.0/24\", \"10.40.0.2/32\"}: may specify no more than one CIDR for each IP family" workItem="bootstrap-e2e-master"
E0512 13:12:32.119682      11 cloud_cidr_allocator.go:187] "Exceeded retry count, dropping from queue" workItem="bootstrap-e2e-master"
I0512 13:12:32.119755      11 event.go:307] "Event occurred" object="bootstrap-e2e-master" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRAssignmentFailed" message="Node bootstrap-e2e-master status is now: CIDRAssignmentFailed"

May 12 '23 13:05 Argh4k

Wojtek's gut feeling was right.

@p0lyn0mial if you want to we can create pr to add:

- --env=KUBE_GCE_PRIVATE_CLUSTER=false

to the tests and they should work just fine. In the meantime I will try to understand why KUBE_GCE_PRIVATE_CLUSTER makes master node to get two CIDRs.

May 12 '23 14:05 Argh4k

Does it have cloud NAT enabled?

If not the private network may be having issues fetching eg from registry.k8s.io which isn't a first-party GCP service unlike GCR

May 12 '23 15:05 BenTheElder

cc @aojea re: GCE cidr allocation :-)

May 12 '23 15:05 BenTheElder

E0512 13:12:32.119671 11 cloud_cidr_allocator.go:178] "Error updating CIDR" err="failed to patch node CIDR: Node "bootstrap-e2e-master" is invalid: spec.podCIDRs: Invalid value: []string{"10.64.0.0/24", "10.40.0.2/32"}: may specify no more than one CIDR for each IP family" workItem="bootstrap-e2e-master"

https://github.com/kubernetes/test-infra/issues/29500#issuecomment-1545732863 @basantsa1989 we have a bug in the allocator https://github.com/kubernetes/kubernetes/commit/a013c6a2db54c59b78de974b181586723e088246

If we receive multiple cidrs before patching for dual-stack we should validate that those are dual stack

We have to fix it in k/k and in the cloud-provider-gcp https://github.com/kubernetes/cloud-provider-gcp/blob/67d1fd9f7255629fac3adfc956d0c8b2ac5f50f0/pkg/controller/nodeipam/ipam/cloud_cidr_allocator.go#L341-L344

May 12 '23 15:05 aojea

FYI: https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/util.sh#L3008 this is the place where we add master internal ip as a second alias if we are using KUBE_GCE_PRIVATE_CLUSTER

Then this second ip is picked by kcm (https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/gce/gce_instances.go#L496) and allocator thinks we have dual stack and tries to apply both of them which fails, because we can have at most one ipv4 cidr per node.

May 12 '23 15:05 Argh4k

Kube controller manager logs:

E0512 13:12:32.119653      11 cloud_cidr_allocator.go:315] "Failed to update the node PodCIDR after multiple attempts" err="failed to patch node CIDR: Node \"bootstrap-e2e-master\" is invalid: spec.podCIDRs: Invalid value: []string{\"10.64.0.0/24\", \"10.40.0.2/32\"}: may specify no more than one CIDR for each IP family" node="bootstrap-e2e-master" cidrStrings=["10.64.0.0/24","10.40.0.2/32"]
E0512 13:12:32.119671      11 cloud_cidr_allocator.go:178] "Error updating CIDR" err="failed to patch node CIDR: Node \"bootstrap-e2e-master\" is invalid: spec.podCIDRs: Invalid value: []string{\"10.64.0.0/24\", \"10.40.0.2/32\"}: may specify no more than one CIDR for each IP family" workItem="bootstrap-e2e-master"
E0512 13:12:32.119682      11 cloud_cidr_allocator.go:187] "Exceeded retry count, dropping from queue" workItem="bootstrap-e2e-master"
I0512 13:12:32.119755      11 event.go:307] "Event occurred" object="bootstrap-e2e-master" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRAssignmentFailed" message="Node bootstrap-e2e-master status is now: CIDRAssignmentFailed"

@Argh4k do you have the entire logs?

May 15 '23 12:05 aojea

@aojea https://gcsweb.k8s.io/gcs/sig-scalability-logs/ci-kubernetes-e2e-gci-gce-scalability-watch-list-off/1658029086385115136/bootstrap-e2e-master/ has all the logs from the master

May 15 '23 13:05 Argh4k

/sig network

May 25 '23 12:05 wojtek-t

based on @basantsa1989 comment https://github.com/kubernetes/kubernetes/pull/118043#issuecomment-1553661135 the allocator is working as expected and the problem is that this is not supported

https://github.com/kubernetes/kubernetes/blob/8db4d63245a89a78d76ff5916c37439805b11e5f/cluster/gce/util.sh#L3008

can we configure the cluster in a different way we don't pass two cidrs?

May 25 '23 13:05 aojea

I hope we can, unfortunately I haven't had much time to look into this and other work was unblocked by running tests in a small public cluster.

May 26 '23 08:05 Argh4k

@Argh4k Hey, a friendly remainder to work on this issue :)

It looks like having a private cluster would increase egress traffic. Having a higher egress bandwidth would allow us to generate a larger test traffic. Currently, we had to reduce the test traffic because it seems that latency is being throttled due to the limited egress bandwidth.

See https://github.com/kubernetes/perf-tests/issues/2287

Jul 27 '23 08:07 p0lyn0mial

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 25 '24 08:01 k8s-triage-robot

I think that this issue still hasn't been resolved

/remove-lifecycle stale

Jan 25 '24 13:01 p0lyn0mial

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 24 '24 13:04 k8s-triage-robot

I think that this issue still hasn't been resolved

/remove-lifecycle stale

Apr 25 '24 07:04 p0lyn0mial

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 24 '24 07:07 k8s-triage-robot

/remove-lifecycle stale

Jul 24 '24 07:07 wojtek-t

@aojea thoughts on this?

Jul 31 '24 05:07 BenTheElder

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 29 '24 05:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Nov 28 '24 06:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Dec 28 '24 07:12 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Dec 28 '24 07:12 k8s-ci-robot