origin
origin copied to clipboard
Make controller clients more resilient to token changes
Controllers are using service account tokens to make API calls for builds/deployments/replication controllers.
There are circumstances where their API tokens are no longer valid:
- Token is deleted/rotated
- Signing key is changed
Under those circumstances, the infrastructure components should be able to obtain an updated token without requiring the admin to manually delete/recreate tokens or restart the server.
I question that we view this as p2 1.5 years later :)
more things are using controllers now. I'd call it p1
Could you provide steps to reproduce the situation which currently leads to failure?
with the controllers running, remove the service account tokens in the openshift-infra or kube-system namespaces. the controllers start encountering 401 errors, but do not exit, which would allow them to pick up new valid credentials
I have OpenShift 3.7 all-on-one setup.
I've created deployment via
oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git
and got
# oc status
In project innovation-2017 on server https://qe-blade-11.idmqe.lab.eng.bos.redhat.com:8443
http://ruby-ex.example.test to pod port 8080-tcp (svc/ruby-ex)
dc/ruby-ex deploys istag/ruby-ex:latest <-
bc/ruby-ex source builds https://github.com/openshift/ruby-ex.git on istag/ruby-22-centos7:latest
deployment #1 deployed 20 hours ago - 1 pod
View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'.
There are no secrets in
# oc get secrets -n kuby-system
No resources found.
but there's a bunch of them in openshift-infra
:
# oc get secrets -n openshift-infra
NAME TYPE DATA AGE
[...]
service-serving-cert-controller-dockercfg-3jls5 kubernetes.io/dockercfg 1 20h
service-serving-cert-controller-token-0vlf6 kubernetes.io/service-account-token 4 20h
service-serving-cert-controller-token-p6672 kubernetes.io/service-account-token 4 20h
serviceaccount-controller-dockercfg-xt97c kubernetes.io/dockercfg 1 20h
serviceaccount-controller-token-5tp8n kubernetes.io/service-account-token 4 20h
serviceaccount-controller-token-zmnf9 kubernetes.io/service-account-token 4 20h
serviceaccount-pull-secrets-controller-dockercfg-1clmr kubernetes.io/dockercfg 1 20h
serviceaccount-pull-secrets-controller-token-41gmz kubernetes.io/service-account-token 4 20h
serviceaccount-pull-secrets-controller-token-7p29z kubernetes.io/service-account-token 4 20h
[...]
I've deleted the two serviceaccount-controller-token
s:
# oc delete secret/serviceaccount-controller-token-5tp8n -n openshift-infra
secret "serviceaccount-controller-token-5tp8n" deleted
# oc delete secret/serviceaccount-controller-token-zmnf9 -n openshift-infra
secret "serviceaccount-controller-token-zmnf9" deleted
I have redeployed with
# oc rollout latest dc/ruby-ex
deploymentconfig "ruby-ex" rolled out
but things are still passing:
# oc status
In project innovation-2017 on server https://qe-blade-11.idmqe.lab.eng.bos.redhat.com:8443
http://ruby-ex.example.test to pod port 8080-tcp (svc/ruby-ex)
dc/ruby-ex deploys istag/ruby-ex:latest <-
bc/ruby-ex source builds https://github.com/openshift/ruby-ex.git on istag/ruby-22-centos7:latest
deployment #2 deployed 4 minutes ago - 1 pod
deployment #1 deployed 20 hours ago
View details with 'oc describe <resource>/<name>' or list everything with 'oc get all'.
with no (relevant) errors in the node log.
What should I have done differently to reproduce the issue?
There are no secrets in
# oc get secrets -n kuby-system
kube-system :)
The errors would appear in the controllers log. For example, deleting the build-controller-* secrets would disrupt the build controller.
kube-system :)
Auch. Sorry about that.
The errors would appear in the controllers log. For example, deleting the build-controller-* secrets would disrupt the build controller.
I've deleted
# oc delete secret/replication-controller-token-8zfdh secret/replication-controller-token-v41d4 -n kube-system
and that seems to have disrupted the rc/ruby-ex
:
ruby-ex-5-deploy 0/1 Error 0 1h
Where do I find the controller logs? In the -deploy
pod log, there does not seem anything specific about the 401 errors:
# oc logs ruby-ex-5-deploy
--> Scaling up ruby-ex-5 from 0 to 1, scaling down ruby-ex-4 from 1 to 0 (keep 1 pods available, don't exceed 2 pods)
Scaling ruby-ex-5 up to 1
error: timed out waiting for "ruby-ex-5" to be synced
in whatever process is running the openshift controllers, which depends on your setup method:
- stdout of
openshift start master
- output of the apiserver container with
oc cluster up
-
journalctl -u atomic-openshift-controllers.service
on an installed system - etc
Thanks, I've got the reproducer and the error messages now.
As for
Under those circumstances, the infrastructure components should be able to obtain an updated token without requiring the admin to manually delete/recreate tokens or restart the server.
is the re-retrieval of the token already done somewhere in the code base so that we could reuse the same code, or is this the first time something like this is implemented in Kubernetes/OpenShift?
Observation on Kubernetes master: when secret/replication-controller-token-*
is deleted, it gets automatically recreated:
$ cluster/kubectl.sh get secrets -n kube-system | grep replication-controller-token
replication-controller-token-czbln kubernetes.io/service-account-token 3 3m
$ cluster/kubectl.sh delete secret replication-controller-token-czbln -n kube-system
secret "replication-controller-token-czbln" deleted
$ cluster/kubectl.sh get secrets -n kube-system | grep replication-controller-token
replication-controller-token-fkfbn kubernetes.io/service-account-token 3 2s
$
If I keep creating pods from
apiVersion: v1
kind: Pod
metadata:
generateName: test-security-context-
spec:
restartPolicy: Never
containers:
- name: test-security-context
image: centos:latest
command:
- "sleep"
- "infinity"
via cluster/kubectl.sh create -f pod.yaml
in a loop and I also keep deleting secret/replication-controller-token-*
, the containers still get started and running.
Does this mean that Kubernetes addressed (at least some of) the issue in the latest version and OpenShift will get it in 3.8? Or am I way off in trying to reproduce and observe the behaviour?
Does this mean that Kubernetes addressed (at least some of) the issue in the latest version and OpenShift will get it in 3.8?
No. The tokens get recreated, but a running controller manager keeps trying to use the old tokens until restarted.
Ah, mea culpa, I was using pod file when I wanted to work with replication controller file.
With
apiVersion: v1
kind: ReplicationController
metadata:
name: test-token-removal-rc-1
spec:
replicas: 1
selector:
app: test-token-removal
template:
metadata:
name: test-token-removal
labels:
app: test-token-removal
spec:
containers:
- name: test-token-removal-pod-1
image: centos:latest
command:
- "sleep"
- "infinity"
I see the replication_controller.go:422] Unauthorized
and the pod does not get created, after the token was rotated.
What's the preferred method to go about it?
Should the individual controllers exit when they start getting errors.IsUnauthorized(err)
and the controller manager notice that they are gone and restart them?
Or should we add a way for the controller manager to call into the controllers and pass the new tokens to the client objects?
not sure... if it were easy we would have done it earlier :)
options that immediately come to mind:
- modify the controllers to exit on persistent 401s, and the controller manager to restart them (large, lots of upstream changes)
- modify the clientbuilder to create clients for the controllers that embed the ability to fetch a new credential when a persistent 401 is encountered (would likely involve a custom WrapTransport in the config, rather than populating the BearerToken)
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten /remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
/unassign
/unassign
@stlaz @sttts @mfojtik