calico icon indicating copy to clipboard operation
calico copied to clipboard

calico-kube-controller pod is restarting : : Unauthorized

Open cpsrujana opened this issue 3 years ago • 13 comments

Calico kube controller pod is going to crashloopback and restarting

Error

2022-05-19 13:33:46.822 [INFO][1] main.go 92: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"} W0519 13:33:46.828542 1 client_config.go:615] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. 2022-05-19 13:33:46.831 [INFO][1] main.go 113: Ensuring Calico datastore is initialized 2022-05-19 13:33:46.856 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=connection is unauthorized: Unauthorized 2022-05-19 13:33:46.856 [FATAL][1] main.go 118: Failed to initialize Calico datastore error=connection is unauthorized: Unauthorized

Context

Kubernetes is installed with calico cni. All pods are up and running. System is kept idle for 7days. We have observed that calico is restarting multiple time with the an error

[ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=connection is unauthorized: Unauthorized [FATAL][1] main.go 118: Failed to initialize Calico datastore error=connection is unauthorized: Unauthorized

Environment

  • Calico version: v3.19.2
  • K8s Version: v1.21.5
  • Operating System and version: Ubuntu

cpsrujana avatar May 20 '22 10:05 cpsrujana

2022-05-19 13:33:46.856 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=connection is unauthorized: Unauthorized

I'd recommend checking your API server logs to see if they contain more information as to why the connection was deemed unauthorized.

caseydavenport avatar May 20 '22 18:05 caseydavenport

2022-05-19 13:33:46.856 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=connection is unauthorized: Unauthorized

I'd recommend checking your API server logs to see if they contain more information as to why the connection was deemed unauthorized.

Thanks a lot for the response. Below is the error in the apiserver log:

E0519 13:34:15.705018 1 claims.go:126] unexpected validation error: *errors.errorString E0519 13:34:15.705174 1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, Token could not be validated.]"

Could you please provide me pointers, when this can happen. Cluster was in this state for sometime. Later calico-kube-controller pod went to up and running. What might have gone wrong during that time ?

cpsrujana avatar May 23 '22 07:05 cpsrujana

It's hard to say, Calico kube-controllers just uses the token provided to it by Kubernetes, so if we were using an invalid token it's probably because Kubernetes failed to update the token mounted into the pod or because the kubernetes client code failed to notice that the token was updated.

caseydavenport avatar May 23 '22 15:05 caseydavenport

It's hard to say, Calico kube-controllers just uses the token provided to it by Kubernetes, so if we were using an invalid token it's probably because Kubernetes failed to update the token mounted into the pod or because the kubernetes client code failed to notice that the token was updated.

We have brought up this setup using kubespray and kept idle for few days. Suddenly, we observed that calico kube-controller pod is going down with this error and coming back to running after sometime.

cpsrujana avatar May 24 '22 11:05 cpsrujana

I think kubespray might provide its own version of the Calico manifests? Could you try reinstalling Calico (either the same version or a later one) with our manifests from https://projectcalico.docs.tigera.io/about/about-calico and see if that might change any instability?

mgleung avatar May 31 '22 16:05 mgleung

@mgleung @caseydavenport,

when the calico binary starts, it will initialize cluster information using calicoclient object. that object will have the calico env variable and kubeconfig(if not present, it will use in cluster config). what is this in cluster config? where it is defined?

use that object, i could see it is creating by default cluster configuration clusterInfo, err := c.ClusterInformation().Get(ctx, globalClusterInfoName, options.GetOptions{}) with datastoreType as kubernetes.

if err != nil {
                        // Create the default config if it doesn't already exist.
                        if _, ok := err.(cerrors.ErrorResourceDoesNotExist); ok {
                                newClusterInfo := v3.NewClusterInformation()
                                newClusterInfo.Name = globalClusterInfoName
                                newClusterInfo.Spec.CalicoVersion = calicoVersion
                                newClusterInfo.Spec.ClusterType = clusterType
                                newClusterInfo.Spec.ClusterGUID = fmt.Sprintf("%s", hex.EncodeToString(uuid.NewV4().Bytes()))
                                datastoreReady := true
                                newClusterInfo.Spec.DatastoreReady = &datastoreReady
                                _, err = c.ClusterInformation().Create(ctx, newClusterInfo, options.SetOptions{})
                                if err != nil {
                                        if _, ok := err.(cerrors.ErrorResourceAlreadyExists); ok {
                                                log.Info("Failed to create global ClusterInformation; another node got there first.")
                                                time.Sleep(1 * time.Second)
                                                continue
                                        }
                                        log.WithError(err).WithField("ClusterInformation", newClusterInfo).Errorf("Error creating cluster information config")
                                        return err
                                }
                        } else {
                                log.WithError(err).WithField("ClusterInformation", globalClusterInfoName).Errorf("Error getting cluster information config")
                                return err
                        }
                        break
                }

And in else loop it is failing,

I have doubts while creating the default cluster configuration.

  1. It will contact the API server for some sort of parameter?
  2. how key/certs will be used by the controller to connect to the API server?
  3. Is those certs are being rotated/renew after some time?
  4. where are all those certs are defined or being used?

akshaysharama avatar Jun 08 '22 10:06 akshaysharama

The key / certs / token used by kube-controllers are all injected into the container by the kubelet on the system, and then interpreted by the Kubernetes golang client when creating a connection. The go client automatically handles any rotation of those credentials once the projection into the container is updated.

caseydavenport avatar Jun 10 '22 09:06 caseydavenport

You should check the token within the pod to make sure that it's valid. We use in-cluster configuration as described here: https://github.com/kubernetes/client-go/tree/master/examples/in-cluster-client-configuration#authenticating-inside-the-cluster

Given the API server logs are complaining about the token, it might be worth verifying that the token mounted at /var/run/secrets/kubernetes.io/serviceaccount is actually valid, since that's the token the Calico code will use.

caseydavenport avatar Jun 10 '22 09:06 caseydavenport

@cpsrujana Any updates?

song-jiang avatar Jun 14 '22 16:06 song-jiang

I'm going to close this for now due to inactivity but please feel free to reopen this issue if you're still experiencing this issue and have diags to provide.

lmm avatar Jun 28 '22 16:06 lmm

@lmm, can you please reopen the issue? we are still debugging and facing the issue

cpsrujana avatar Jun 30 '22 11:06 cpsrujana

@caseydavenport,

Hi, i've tried checking the token validity from the path " /var/lib/kubelet/pods/5cc9962f-818f-4ff1-9824-801963d15893/../token"

path: where the pod has mounted the /var/run/secrets/kubernetes.io/serviceaccount.

I could the see the token validity is for 1yr from the day it got created. Then, why token is invalid or unauthorized as it is valid for a year? Our deployment is running from past 48d and we are seeing 172 restarts.

Below is the decoded token.

{ "aud": [ "https://kubernetes.default.svc.cluster.local" ], "exp": 1688132695, "iat": 1656596695, "iss": "https://kubernetes.default.svc.cluster.local", "kubernetes.io": { "namespace": "kube-system", "pod": { "name": "calico-kube-controllers-76758f8d4c-6l2dr", "uid": "5cc9962f-818f-4ff1-9824-801963d15893" }, "serviceaccount": { "name": "calico-kube-controllers", "uid": "6d8ba93f-ceaf-47f3-a90b-02e7c1d55745" }, "warnafter": 1656600302 }, "nbf": 1656596695, "sub": "system:serviceaccount:kube-system:calico-kube-controllers" }

cpsrujana avatar Jun 30 '22 11:06 cpsrujana

@cpsrujana it's hard to say with just the token what the issue could be here. I would say we probably want to try a few basic things to validate the token:

  • Quick check if the token is valid by adding it to a curl request to the K8s API: curl --cacert ca.crt -H "Authorization: Bearer {token}" https://kubernetes.default/api/v1/pod/namespaces/{namespace}
  • Check the token against the token request API (https://kubernetes.io/docs/reference/kubernetes-api/authentication-resources/token-request-v1/)

Hopefully those will turn up some more details. On another note, could you describe your environment a bit more? I know that you mentioned earlier that this is a kubespray setup but are there other details about your environment that you could share with us?

mgleung avatar Jul 26 '22 16:07 mgleung