kops icon indicating copy to clipboard operation
kops copied to clipboard

IRSA not working for kops 1.22.4

Open jakebyrd opened this issue 1 year ago • 10 comments

/kind support

1. What kops version are you running? The command kops version, will display this information.

1.22.4

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

1.21.9

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

serviceAccountIssuerDiscovery:
    discoveryStore: s3://bucket-name
    enableAWSOIDCProvider: true

was added to cluster spec as well as

useServiceAccountExternalPermissions: true

Service account and deployment were created through kubectl.

Pod is running with the correct env variables for default region, region, role ARN, web-identity-token-file, and sts-regional-endpoints.

However, when running kubectl exec -it -n default pod-identity-webhook-test -- aws sts get-caller-identity or any other aws command I get the error:

Unable to locate credentials. You can configure credentials by running "aws configure".

5. Anything else do we need to know?

Not sure if this is an issue with just not having pod-identity-webhook but using newer version of useServiceAccountExternalPermissions: true .

Can share specific lines of cluster manifest or deployment if needed.

AWS Role is also essentially the same as https://github.com/kubernetes/kops/blob/master/tests/integration/update_cluster/many-addons-ccm-irsa24/data/aws_iam_role_aws-load-balancer-controller.kube-system.sa.minimal.example.com_policy as well

jakebyrd avatar Jul 12 '22 17:07 jakebyrd

I don't believe that Kops 1.22.x is currently supported. You should upgrade to at least 1.23.2 and test there (Kops 1.23.x is compatible with Kubernetes 1.22.x).

Note also that the bucket you use for this cannot have any periods in the bucket name or it won't work. Note also that you need to use a different bucket than your state bucket and make sure that after you do a kops update that files are actually created in the oidc bucket.

Kubernetes 1.22.1 Not that I think this is the culprit or anything but you should probably upgrade to the latest 1.22 release (1.22.11 I believe) just because there are a few known vulnerabilities in that version.

ReillyBrogan avatar Jul 12 '22 18:07 ReillyBrogan

So I was actually wrong about kubernetes version. We are running kubernetes 1.21.9. As far as the bucket, we have tried multiple buckets and none of them have periods in them. We can upgrade kops but how can we know that that will fix the issue? Also let me know if you need more info to troubleshoot this better.

jakebyrd avatar Jul 13 '22 17:07 jakebyrd

We can upgrade kops but how can we know that that will fix the issue?

How do you know it won't? There have been a huge amount of bugfixes between Kops 1.22 and Kops 1.23/1.24 and IRSA has seen a number of them as the feature has continued to mature. Kops 1.23/1.24 does still support Kubernetes 1.21.

We are running kubernetes 1.21.9

Kubernetes 1.21 is end of life. You should strongly consider upgrading.

Also let me know if you need more info to troubleshoot this better.

Like I said in my previous comment you should check that kops update cluster is actually populating your OIDC bucket with the appropriate files and that the bucket has the right permissions on it (the bucket MUST be world-readable).

ReillyBrogan avatar Jul 13 '22 20:07 ReillyBrogan

Did you install pod-identity-webhook manually? And is it working fine?

h3poteto avatar Jul 14 '22 01:07 h3poteto

You may be able to see the error in the pod-identity-webhook logs as well.

ReillyBrogan avatar Jul 14 '22 19:07 ReillyBrogan

So we have updated kops up to 1.23.2, however when using our pipeline we run into the following issue.

│ The plugin returned an unexpected error from
│ plugin.(*GRPCProvider).PlanResourceChange: rpc error: code =
│ ResourceExhausted desc = grpc: received message larger than max (4304578
│ vs. 4194304)

This error is with a certmanager-addons bucket object. This seems to be an issue with terraform, but wanted to reach out to see if you all thought it could be something else?

jakebyrd avatar Jul 19 '22 15:07 jakebyrd

We are having issues now with the pod identity webhook it seems. Here are the kubelet logs.

-- Logs begin at Thu 2022-02-17 16:52:04 UTC. --
Jul 19 20:47:32 ip-10-15-4-159 kubelet[5318]: E0719 20:47:32.011164    5318 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jul 19 20:47:36 ip-10-15-4-159 kubelet[5318]: E0719 20:47:36.319480    5318 kubelet.go:1683] "Failed creating a mirror pod for" err="Internal error occurred: failed calling webhook \"pod-identity-webhook.amazonaws.com\": Post \"https://pod-identity-webhook.kube-system.svc:443/mutate?timeout=10s\": dial tcp 100.66.148.239:443: connect: connection refused" pod="kube-system/kube-controller-manager-ip-10-15-4-159.us-west-2.compute.internal"
Jul 19 20:47:37 ip-10-15-4-159 kubelet[5318]: E0719 20:47:37.012028    5318 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jul 19 20:47:42 ip-10-15-4-159 kubelet[5318]: E0719 20:47:42.013117    5318 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jul 19 20:47:42 ip-10-15-4-159 kubelet[5318]: E0719 20:47:42.308427    5318 kubelet.go:1683] "Failed creating a mirror pod for" err="Internal error occurred: failed calling webhook \"pod-identity-webhook.amazonaws.com\": Post \"https://pod-identity-webhook.kube-system.svc:443/mutate?timeout=10s\": dial tcp 100.66.148.239:443: connect: connection refused" pod="kube-system/etcd-manager-main-ip-10-15-4-159.us-west-2.compute.internal"
Jul 19 20:47:47 ip-10-15-4-159 kubelet[5318]: E0719 20:47:47.014484    5318 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jul 19 20:47:47 ip-10-15-4-159 kubelet[5318]: E0719 20:47:47.295964    5318 kubelet.go:1683] "Failed creating a mirror pod for" err="Internal error occurred: failed calling webhook \"pod-identity-webhook.amazonaws.com\": Post \"https://pod-identity-webhook.kube-system.svc:443/mutate?timeout=10s\": dial tcp 100.66.148.239:443: connect: connection refused" pod="kube-system/kube-proxy-ip-10-15-4-159.us-west-2.compute.internal"
Jul 19 20:47:51 ip-10-15-4-159 kubelet[5318]: E0719 20:47:51.296391    5318 kubelet.go:1683] "Failed creating a mirror pod for" err="Internal error occurred: failed calling webhook \"pod-identity-webhook.amazonaws.com\": Post \"https://pod-identity-webhook.kube-system.svc:443/mutate?timeout=10s\": dial tcp 100.66.148.239:443: connect: connection refused" pod="kube-system/etcd-manager-events-ip-10-15-4-159.us-west-2.compute.internal"
Jul 19 20:47:52 ip-10-15-4-159 kubelet[5318]: E0719 20:47:52.015670    5318 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jul 19 20:47:57 ip-10-15-4-159 kubelet[5318]: E0719 20:47:57.016882    5318 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jul 19 20:48:02 ip-10-15-4-159 kubelet[5318]: E0719 20:48:02.017397    5318 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jul 19 20:48:07 ip-10-15-4-159 kubelet[5318]: E0719 20:48:07.017993    5318 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jul 19 20:48:12 ip-10-15-4-159 kubelet[5318]: E0719 20:48:12.019221    5318 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jul 19 20:48:17 ip-10-15-4-159 kubelet[5318]: E0719 20:48:17.019799    5318 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jul 19 20:48:22 ip-10-15-4-159 kubelet[5318]: E0719 20:48:22.021300    5318 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jul 19 20:48:24 ip-10-15-4-159 kubelet[5318]: E0719 20:48:24.288016    5318 kubelet.go:1683] "Failed creating a mirror pod for" err="Internal error occurred: failed calling webhook \"pod-identity-webhook.amazonaws.com\": Post \"https://pod-identity-webhook.kube-system.svc:443/mutate?timeout=10s\": dial tcp 100.66.148.239:443: connect: connection refused" pod="kube-system/kube-scheduler-ip-10-15-4-159.us-west-2.compute.internal"
Jul 19 20:48:26 ip-10-15-4-159 kubelet[5318]: I0719 20:48:26.030440    5318 kubelet_getters.go:176] "Pod status updated" pod="kube-system/kube-proxy-ip-10-15-4-159.us-west-2.compute.internal" status=Running
Jul 19 20:48:26 ip-10-15-4-159 kubelet[5318]: I0719 20:48:26.030486    5318 kubelet_getters.go:176] "Pod status updated" pod="kube-system/kube-scheduler-ip-10-15-4-159.us-west-2.compute.internal" status=Running
Jul 19 20:48:26 ip-10-15-4-159 kubelet[5318]: I0719 20:48:26.030500    5318 kubelet_getters.go:176] "Pod status updated" pod="kube-system/etcd-manager-events-ip-10-15-4-159.us-west-2.compute.internal" status=Running
Jul 19 20:48:26 ip-10-15-4-159 kubelet[5318]: I0719 20:48:26.030514    5318 kubelet_getters.go:176] "Pod status updated" pod="kube-system/kube-apiserver-ip-10-15-4-159.us-west-2.compute.internal" status=Running
Jul 19 20:48:26 ip-10-15-4-159 kubelet[5318]: I0719 20:48:26.030527    5318 kubelet_getters.go:176] "Pod status updated" pod="kube-system/kube-controller-manager-ip-10-15-4-159.us-west-2.compute.internal" status=Running
Jul 19 20:48:26 ip-10-15-4-159 kubelet[5318]: I0719 20:48:26.030540    5318 kubelet_getters.go:176] "Pod status updated" pod="kube-system/etcd-manager-main-ip-10-15-4-159.us-west-2.compute.internal" status=Running
Jul 19 20:48:27 ip-10-15-4-159 kubelet[5318]: E0719 20:48:27.022458    5318 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jul 19 20:48:32 ip-10-15-4-159 kubelet[5318]: E0719 20:48:32.023770    5318 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jul 19 20:48:37 ip-10-15-4-159 kubelet[5318]: E0719 20:48:37.024581    5318 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jul 19 20:48:42 ip-10-15-4-159 kubelet[5318]: E0719 20:48:42.025196    5318 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

When cert-manager is is enabled and set to managed:true we get the above grpc error.

But when cert-manager is set to managed: false then we run into this current issue and nodes and masters not getting up and running.

Any help would be appreciated. Thanks

jakebyrd avatar Jul 21 '22 15:07 jakebyrd

Are you manually installing the webhook? Or custom static pods? It seems like the webhook blocks pods that the webhook shouldn't care about.

olemarkus avatar Jul 21 '22 15:07 olemarkus

We installed using the default installation method that kops provides

jakebyrd avatar Jul 21 '22 18:07 jakebyrd

Can you get the mutating webhook object and paste it here? It looks like it is matching things it never should. I am also confused about the webhook being installed before kops 1.23.

I also suggest updating to kops 1.24.

olemarkus avatar Jul 21 '22 18:07 olemarkus

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 19 '22 19:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Nov 18 '22 20:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Dec 18 '22 20:12 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Dec 18 '22 20:12 k8s-ci-robot