Wait until ready returns success immediately
Is this a BUG REPORT or FEATURE REQUEST?:
- bug
What happened: The put step with a wait_until_ready_selector is returning success immediately. It's almost too fast for its own good!
What you expected to happen: I expect the wait step to wait until the deployment update is complete.
How to reproduce it (as minimally and precisely as possible): Have a kubernetes deployment with a normal RollingUpdate strategy. Use this kubernetes-resource to put changes to the deployment like so:
- put: prod-kube
params:
kubectl: apply -f ymls/my-app-deployment.yml
wait_until_ready: 60
wait_until_ready_selector: app=my-app
When this step runs I see this in the output:
+ kubectl apply -f ymls/my-app-deployment.yml
deployment.extensions/my-app configured
Waiting for pods to be ready for 60s (interval: 3s, selector: app=my-app)
Waiting for pods to be ready... (0/0)
This returns true immediately despite the fact that the new pod/replicaset hasn't actually spun up yet. It seems like resource is checking the ready status before the new pod is even created. Likely some kind of race condition with Kubernetes.
Environment:
- Concourse CI version: 5.7.0
- kubernetes-resource image version: 1.15
@mattdodge Hey! Thanks a lot for your detailed report.
This does seem like a race issue.
We do currently rely solely on wait_until_ready_selector to select which pods to be examined for readiness. I believe this would result in including pods from the old generation of the deployment.
A possible fix here would be to leverage the "revision" number of the deployment that is inherited down to the replicaset and pods.
-
Look for the new "revision" of the deployment:
apiVersion: extensions/v1beta1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "3" # snip -
Look for the replicaset generated for that revision. You could use the
deployment.kubernetes.io/revision: "3"annotation in combination with the owner reference to select the replicaset:apiVersion: extensions/v1beta1 kind: ReplicaSet metadata: annotations: deployment.kubernetes.io/desired-replicas: "2" deployment.kubernetes.io/max-replicas: "4" deployment.kubernetes.io/revision: "3" creationTimestamp: "2019-11-05T09:08:38Z" generation: 1 labels: app: envoy component: controller pod-template-hash: 67dc7dd6c5 name: envoy-67dc7dd6c5 namespace: default ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: true controller: true kind: Deployment name: envoy # snip -
Use
wait_until_ready_selectoralong withpod-template-hashor the owner reference to select pods to be count from the latest revision only.apiVersion: v1 kind: Pod metadata: annotations: creationTimestamp: "2019-11-05T09:08:38Z" generateName: envoy-67dc7dd6c5- labels: app: envoy component: controller pod-template-hash: 67dc7dd6c5 name: envoy-67dc7dd6c5-8m6hc namespace: default ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: true controller: true kind: ReplicaSet name: envoy-67dc7dd6c5 uid: dab0f913-ffab-11e9-92cd-025000000001 # snip
But implementing this straight forward requires us to change the configuration syntax, maybe adding wait_until_ready_deployment: DEPLOYMENT_NAME, deprecating and removing the existing wait_until_ready_selector. I'd hope we can avoid that if possible.
@superbrothers WDYT? Did you have any reason to avoid relying on the "revision" number?
I would be in favor of a wait_until_ready_deployment (or something similar) option. It's not quite as flexible as the label selector but it probably covers most of the common use cases.
If we're going that route we could probably make use of the kubectl rollout status command too, rather than digging through revision numbers and all that. This command will wait for the deployment to successfully roll out or will return an error if the timeout is hit. This is actually how I'm getting around this race condition for now. I have a pipeline YML that looks like this:
- put: prod-kube
params:
kubectl: apply -f ymls/my-app-deployment.yml
- put: prod-kube
params:
kubectl: rollout status deployment/my-app --timeout 60s
I'm sorry for the late reply.
But implementing this straight forward requires us to change the configuration syntax, maybe adding
wait_until_ready_deployment: DEPLOYMENT_NAME, deprecating and removing the existingwait_until_ready_selector. I'd hope we can avoid that if possible.
Adding wait_util_ready_<resource> is not flexible, so I don't want to do it as much as possible.
If we're going that route we could probably make use of the
kubectl rollout statuscommand too, rather than digging through revision numbers and all that.
Yes, I think so that this is right way. However, wait_until_ready is a required parameter, so it is a breaking change to be changed to a option parameter. At this time, I recommend that you sets wait_until_ready to 0. 0 means don't wait. (I know this is too verbose...)
- put: prod-kube
params:
kubectl: apply -f ymls/my-app-deployment.yml
wait_until_ready: 0
- put: prod-kube
params:
kubectl: rollout status deployment/my-app --timeout 60s
wait_until_ready: 0
I will consider to delete wait_until_ready param and bumps up the major version.