origin Make controller clients more resilient to token changes

Controllers are using service account tokens to make API calls for builds/deployments/replication controllers.

There are circumstances where their API tokens are no longer valid:

Token is deleted/rotated
Signing key is changed

Under those circumstances, the infrastructure components should be able to obtain an updated token without requiring the admin to manually delete/recreate tokens or restart the server.

Jun 29 '15 19:06 liggitt

I question that we view this as p2 1.5 years later :)

Jan 23 '17 03:01 smarterclayton

more things are using controllers now. I'd call it p1

Jan 23 '17 03:01 liggitt

Could you provide steps to reproduce the situation which currently leads to failure?

Oct 16 '17 13:10 adelton

with the controllers running, remove the service account tokens in the openshift-infra or kube-system namespaces. the controllers start encountering 401 errors, but do not exit, which would allow them to pick up new valid credentials

Oct 16 '17 13:10 liggitt

I have OpenShift 3.7 all-on-one setup.

I've created deployment via

oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git

and got

# oc status
In project innovation-2017 on server https://qe-blade-11.idmqe.lab.eng.bos.redhat.com:8443

http://ruby-ex.example.test to pod port 8080-tcp (svc/ruby-ex)
  dc/ruby-ex deploys istag/ruby-ex:latest <-
    bc/ruby-ex source builds https://github.com/openshift/ruby-ex.git on istag/ruby-22-centos7:latest 
    deployment #1 deployed 20 hours ago - 1 pod

View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'.

There are no secrets in

# oc get secrets -n kuby-system
No resources found.

but there's a bunch of them in openshift-infra:

# oc get secrets -n openshift-infra
NAME                                                      TYPE                                  DATA      AGE
[...]
service-serving-cert-controller-dockercfg-3jls5           kubernetes.io/dockercfg               1         20h
service-serving-cert-controller-token-0vlf6               kubernetes.io/service-account-token   4         20h
service-serving-cert-controller-token-p6672               kubernetes.io/service-account-token   4         20h
serviceaccount-controller-dockercfg-xt97c                 kubernetes.io/dockercfg               1         20h
serviceaccount-controller-token-5tp8n                     kubernetes.io/service-account-token   4         20h
serviceaccount-controller-token-zmnf9                     kubernetes.io/service-account-token   4         20h
serviceaccount-pull-secrets-controller-dockercfg-1clmr    kubernetes.io/dockercfg               1         20h
serviceaccount-pull-secrets-controller-token-41gmz        kubernetes.io/service-account-token   4         20h
serviceaccount-pull-secrets-controller-token-7p29z        kubernetes.io/service-account-token   4         20h
[...]

I've deleted the two serviceaccount-controller-tokens:

# oc delete secret/serviceaccount-controller-token-5tp8n -n openshift-infra
secret "serviceaccount-controller-token-5tp8n" deleted
# oc delete secret/serviceaccount-controller-token-zmnf9 -n openshift-infra
secret "serviceaccount-controller-token-zmnf9" deleted

I have redeployed with

# oc rollout latest dc/ruby-ex
deploymentconfig "ruby-ex" rolled out

but things are still passing:

# oc status
In project innovation-2017 on server https://qe-blade-11.idmqe.lab.eng.bos.redhat.com:8443

http://ruby-ex.example.test to pod port 8080-tcp (svc/ruby-ex)
  dc/ruby-ex deploys istag/ruby-ex:latest <-
    bc/ruby-ex source builds https://github.com/openshift/ruby-ex.git on istag/ruby-22-centos7:latest 
    deployment #2 deployed 4 minutes ago - 1 pod
    deployment #1 deployed 20 hours ago

View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'.

with no (relevant) errors in the node log.

What should I have done differently to reproduce the issue?

Oct 17 '17 06:10 adelton

There are no secrets in

# oc get secrets -n kuby-system

kube-system :)

The errors would appear in the controllers log. For example, deleting the build-controller-* secrets would disrupt the build controller.

Oct 17 '17 12:10 liggitt

kube-system :)

Auch. Sorry about that.

The errors would appear in the controllers log. For example, deleting the build-controller-* secrets would disrupt the build controller.

I've deleted

# oc delete secret/replication-controller-token-8zfdh secret/replication-controller-token-v41d4 -n kube-system

and that seems to have disrupted the rc/ruby-ex:

ruby-ex-5-deploy   0/1       Error       0          1h

Where do I find the controller logs? In the -deploy pod log, there does not seem anything specific about the 401 errors:

# oc logs ruby-ex-5-deploy
--> Scaling up ruby-ex-5 from 0 to 1, scaling down ruby-ex-4 from 1 to 0 (keep 1 pods available, don't exceed 2 pods)
    Scaling ruby-ex-5 up to 1
error: timed out waiting for "ruby-ex-5" to be synced

Oct 17 '17 15:10 adelton

in whatever process is running the openshift controllers, which depends on your setup method:

stdout of openshift start master
output of the apiserver container with oc cluster up
journalctl -u atomic-openshift-controllers.service on an installed system
etc

Oct 17 '17 15:10 liggitt

Thanks, I've got the reproducer and the error messages now.

As for

Under those circumstances, the infrastructure components should be able to obtain an updated token without requiring the admin to manually delete/recreate tokens or restart the server.

is the re-retrieval of the token already done somewhere in the code base so that we could reuse the same code, or is this the first time something like this is implemented in Kubernetes/OpenShift?

Oct 20 '17 12:10 adelton

Observation on Kubernetes master: when secret/replication-controller-token-* is deleted, it gets automatically recreated:

$ cluster/kubectl.sh get secrets -n kube-system | grep replication-controller-token
replication-controller-token-czbln       kubernetes.io/service-account-token   3         3m
$ cluster/kubectl.sh delete secret replication-controller-token-czbln -n kube-system
secret "replication-controller-token-czbln" deleted
$ cluster/kubectl.sh get secrets -n kube-system | grep replication-controller-token
replication-controller-token-fkfbn       kubernetes.io/service-account-token   3         2s
$

If I keep creating pods from

apiVersion: v1
kind: Pod
metadata:
  generateName: test-security-context-
spec:
  restartPolicy: Never
  containers:
  - name: test-security-context
    image: centos:latest
    command:
    - "sleep"
    - "infinity"

via cluster/kubectl.sh create -f pod.yaml in a loop and I also keep deleting secret/replication-controller-token-*, the containers still get started and running.

Does this mean that Kubernetes addressed (at least some of) the issue in the latest version and OpenShift will get it in 3.8? Or am I way off in trying to reproduce and observe the behaviour?

Oct 24 '17 13:10 adelton

Does this mean that Kubernetes addressed (at least some of) the issue in the latest version and OpenShift will get it in 3.8?

No. The tokens get recreated, but a running controller manager keeps trying to use the old tokens until restarted.

Oct 24 '17 14:10 liggitt

Ah, mea culpa, I was using pod file when I wanted to work with replication controller file.

With

apiVersion: v1
kind: ReplicationController
metadata:
  name: test-token-removal-rc-1
spec:
  replicas: 1
  selector:
    app: test-token-removal
  template:
    metadata:
      name: test-token-removal
      labels:
        app: test-token-removal
    spec:
      containers:
      - name: test-token-removal-pod-1
        image: centos:latest
        command:
        - "sleep"
        - "infinity"

I see the replication_controller.go:422] Unauthorized and the pod does not get created, after the token was rotated.

Oct 24 '17 14:10 adelton

What's the preferred method to go about it?

Should the individual controllers exit when they start getting errors.IsUnauthorized(err) and the controller manager notice that they are gone and restart them?

Or should we add a way for the controller manager to call into the controllers and pass the new tokens to the client objects?

Oct 26 '17 08:10 adelton

not sure... if it were easy we would have done it earlier :)

options that immediately come to mind:

modify the controllers to exit on persistent 401s, and the controller manager to restart them (large, lots of upstream changes)
modify the clientbuilder to create clients for the controllers that embed the ability to fetch a new credential when a persistent 401 is encountered (would likely involve a custom WrapTransport in the config, rather than populating the BearerToken)

Nov 02 '17 18:11 liggitt

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Feb 24 '18 11:02 openshift-bot

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

Mar 26 '18 11:03 openshift-bot

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Apr 25 '18 11:04 openshift-bot

/unassign

Aug 30 '18 16:08 liggitt

/unassign

@stlaz @sttts @mfojtik

Oct 16 '19 15:10 enj

origin origin copied to clipboard

Make controller clients more resilient to token changes

origin
origin copied to clipboard