origin icon indicating copy to clipboard operation
origin copied to clipboard

Make controller clients more resilient to token changes

Open liggitt opened this issue 9 years ago • 19 comments

Controllers are using service account tokens to make API calls for builds/deployments/replication controllers.

There are circumstances where their API tokens are no longer valid:

  1. Token is deleted/rotated
  2. Signing key is changed

Under those circumstances, the infrastructure components should be able to obtain an updated token without requiring the admin to manually delete/recreate tokens or restart the server.

liggitt avatar Jun 29 '15 19:06 liggitt

I question that we view this as p2 1.5 years later :)

smarterclayton avatar Jan 23 '17 03:01 smarterclayton

more things are using controllers now. I'd call it p1

liggitt avatar Jan 23 '17 03:01 liggitt

Could you provide steps to reproduce the situation which currently leads to failure?

adelton avatar Oct 16 '17 13:10 adelton

with the controllers running, remove the service account tokens in the openshift-infra or kube-system namespaces. the controllers start encountering 401 errors, but do not exit, which would allow them to pick up new valid credentials

liggitt avatar Oct 16 '17 13:10 liggitt

I have OpenShift 3.7 all-on-one setup.

I've created deployment via

oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git

and got

# oc status
In project innovation-2017 on server https://qe-blade-11.idmqe.lab.eng.bos.redhat.com:8443

http://ruby-ex.example.test to pod port 8080-tcp (svc/ruby-ex)
  dc/ruby-ex deploys istag/ruby-ex:latest <-
    bc/ruby-ex source builds https://github.com/openshift/ruby-ex.git on istag/ruby-22-centos7:latest 
    deployment #1 deployed 20 hours ago - 1 pod

View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'.

There are no secrets in

# oc get secrets -n kuby-system
No resources found.

but there's a bunch of them in openshift-infra:

# oc get secrets -n openshift-infra
NAME                                                      TYPE                                  DATA      AGE
[...]
service-serving-cert-controller-dockercfg-3jls5           kubernetes.io/dockercfg               1         20h
service-serving-cert-controller-token-0vlf6               kubernetes.io/service-account-token   4         20h
service-serving-cert-controller-token-p6672               kubernetes.io/service-account-token   4         20h
serviceaccount-controller-dockercfg-xt97c                 kubernetes.io/dockercfg               1         20h
serviceaccount-controller-token-5tp8n                     kubernetes.io/service-account-token   4         20h
serviceaccount-controller-token-zmnf9                     kubernetes.io/service-account-token   4         20h
serviceaccount-pull-secrets-controller-dockercfg-1clmr    kubernetes.io/dockercfg               1         20h
serviceaccount-pull-secrets-controller-token-41gmz        kubernetes.io/service-account-token   4         20h
serviceaccount-pull-secrets-controller-token-7p29z        kubernetes.io/service-account-token   4         20h
[...]

I've deleted the two serviceaccount-controller-tokens:

# oc delete secret/serviceaccount-controller-token-5tp8n -n openshift-infra
secret "serviceaccount-controller-token-5tp8n" deleted
# oc delete secret/serviceaccount-controller-token-zmnf9 -n openshift-infra
secret "serviceaccount-controller-token-zmnf9" deleted

I have redeployed with

# oc rollout latest dc/ruby-ex
deploymentconfig "ruby-ex" rolled out

but things are still passing:

# oc status
In project innovation-2017 on server https://qe-blade-11.idmqe.lab.eng.bos.redhat.com:8443

http://ruby-ex.example.test to pod port 8080-tcp (svc/ruby-ex)
  dc/ruby-ex deploys istag/ruby-ex:latest <-
    bc/ruby-ex source builds https://github.com/openshift/ruby-ex.git on istag/ruby-22-centos7:latest 
    deployment #2 deployed 4 minutes ago - 1 pod
    deployment #1 deployed 20 hours ago

View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'.

with no (relevant) errors in the node log.

What should I have done differently to reproduce the issue?

adelton avatar Oct 17 '17 06:10 adelton

There are no secrets in

# oc get secrets -n kuby-system

kube-system :)

The errors would appear in the controllers log. For example, deleting the build-controller-* secrets would disrupt the build controller.

liggitt avatar Oct 17 '17 12:10 liggitt

kube-system :)

Auch. Sorry about that.

The errors would appear in the controllers log. For example, deleting the build-controller-* secrets would disrupt the build controller.

I've deleted

# oc delete secret/replication-controller-token-8zfdh secret/replication-controller-token-v41d4 -n kube-system

and that seems to have disrupted the rc/ruby-ex:

ruby-ex-5-deploy   0/1       Error       0          1h

Where do I find the controller logs? In the -deploy pod log, there does not seem anything specific about the 401 errors:

# oc logs ruby-ex-5-deploy
--> Scaling up ruby-ex-5 from 0 to 1, scaling down ruby-ex-4 from 1 to 0 (keep 1 pods available, don't exceed 2 pods)
    Scaling ruby-ex-5 up to 1
error: timed out waiting for "ruby-ex-5" to be synced

adelton avatar Oct 17 '17 15:10 adelton

in whatever process is running the openshift controllers, which depends on your setup method:

  • stdout of openshift start master
  • output of the apiserver container with oc cluster up
  • journalctl -u atomic-openshift-controllers.service on an installed system
  • etc

liggitt avatar Oct 17 '17 15:10 liggitt

Thanks, I've got the reproducer and the error messages now.

As for

Under those circumstances, the infrastructure components should be able to obtain an updated token without requiring the admin to manually delete/recreate tokens or restart the server.

is the re-retrieval of the token already done somewhere in the code base so that we could reuse the same code, or is this the first time something like this is implemented in Kubernetes/OpenShift?

adelton avatar Oct 20 '17 12:10 adelton

Observation on Kubernetes master: when secret/replication-controller-token-* is deleted, it gets automatically recreated:

$ cluster/kubectl.sh get secrets -n kube-system | grep replication-controller-token
replication-controller-token-czbln       kubernetes.io/service-account-token   3         3m
$ cluster/kubectl.sh delete secret replication-controller-token-czbln -n kube-system
secret "replication-controller-token-czbln" deleted
$ cluster/kubectl.sh get secrets -n kube-system | grep replication-controller-token
replication-controller-token-fkfbn       kubernetes.io/service-account-token   3         2s
$ 

If I keep creating pods from

apiVersion: v1
kind: Pod
metadata:
  generateName: test-security-context-
spec:
  restartPolicy: Never
  containers:
  - name: test-security-context
    image: centos:latest
    command:
    - "sleep"
    - "infinity"

via cluster/kubectl.sh create -f pod.yaml in a loop and I also keep deleting secret/replication-controller-token-*, the containers still get started and running.

Does this mean that Kubernetes addressed (at least some of) the issue in the latest version and OpenShift will get it in 3.8? Or am I way off in trying to reproduce and observe the behaviour?

adelton avatar Oct 24 '17 13:10 adelton

Does this mean that Kubernetes addressed (at least some of) the issue in the latest version and OpenShift will get it in 3.8?

No. The tokens get recreated, but a running controller manager keeps trying to use the old tokens until restarted.

liggitt avatar Oct 24 '17 14:10 liggitt

Ah, mea culpa, I was using pod file when I wanted to work with replication controller file.

With

apiVersion: v1
kind: ReplicationController
metadata:
  name: test-token-removal-rc-1
spec:
  replicas: 1
  selector:
    app: test-token-removal
  template:
    metadata:
      name: test-token-removal
      labels:
        app: test-token-removal
    spec:
      containers:
      - name: test-token-removal-pod-1
        image: centos:latest
        command:
        - "sleep"
        - "infinity"

I see the replication_controller.go:422] Unauthorized and the pod does not get created, after the token was rotated.

adelton avatar Oct 24 '17 14:10 adelton

What's the preferred method to go about it?

Should the individual controllers exit when they start getting errors.IsUnauthorized(err) and the controller manager notice that they are gone and restart them?

Or should we add a way for the controller manager to call into the controllers and pass the new tokens to the client objects?

adelton avatar Oct 26 '17 08:10 adelton

not sure... if it were easy we would have done it earlier :)

options that immediately come to mind:

  • modify the controllers to exit on persistent 401s, and the controller manager to restart them (large, lots of upstream changes)
  • modify the clientbuilder to create clients for the controllers that embed the ability to fetch a new credential when a persistent 401 is encountered (would likely involve a custom WrapTransport in the config, rather than populating the BearerToken)

liggitt avatar Nov 02 '17 18:11 liggitt

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Feb 24 '18 11:02 openshift-bot

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot avatar Mar 26 '18 11:03 openshift-bot

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-bot avatar Apr 25 '18 11:04 openshift-bot

/unassign

liggitt avatar Aug 30 '18 16:08 liggitt

/unassign

@stlaz @sttts @mfojtik

enj avatar Oct 16 '19 15:10 enj