kubernetes-resource Wait until ready returns success immediately

Is this a BUG REPORT or FEATURE REQUEST?:

bug

What happened: The put step with a wait_until_ready_selector is returning success immediately. It's almost too fast for its own good!

What you expected to happen: I expect the wait step to wait until the deployment update is complete.

How to reproduce it (as minimally and precisely as possible): Have a kubernetes deployment with a normal RollingUpdate strategy. Use this kubernetes-resource to put changes to the deployment like so:


- put: prod-kube
  params:
    kubectl: apply -f ymls/my-app-deployment.yml
    wait_until_ready: 60
    wait_until_ready_selector: app=my-app

When this step runs I see this in the output:

+ kubectl apply -f ymls/my-app-deployment.yml
deployment.extensions/my-app configured
Waiting for pods to be ready for 60s (interval: 3s, selector: app=my-app)
Waiting for pods to be ready... (0/0)

This returns true immediately despite the fact that the new pod/replicaset hasn't actually spun up yet. It seems like resource is checking the ready status before the new pod is even created. Likely some kind of race condition with Kubernetes.

Environment:

Concourse CI version: 5.7.0
kubernetes-resource image version: 1.15

Nov 15 '19 19:11 mattdodge

@mattdodge Hey! Thanks a lot for your detailed report.

This does seem like a race issue.

We do currently rely solely on wait_until_ready_selector to select which pods to be examined for readiness. I believe this would result in including pods from the old generation of the deployment.

A possible fix here would be to leverage the "revision" number of the deployment that is inherited down to the replicaset and pods.

Look for the new "revision" of the deployment:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
# snip

Look for the replicaset generated for that revision. You could use the deployment.kubernetes.io/revision: "3" annotation in combination with the owner reference to select the replicaset:

apiVersion: extensions/v1beta1
kind: ReplicaSet
metadata:
  annotations:
    deployment.kubernetes.io/desired-replicas: "2"
    deployment.kubernetes.io/max-replicas: "4"
    deployment.kubernetes.io/revision: "3"
  creationTimestamp: "2019-11-05T09:08:38Z"
  generation: 1
  labels:
    app: envoy
    component: controller
    pod-template-hash: 67dc7dd6c5
  name: envoy-67dc7dd6c5
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: Deployment
    name: envoy
# snip

Use wait_until_ready_selector along with pod-template-hash or the owner reference to select pods to be count from the latest revision only.

apiVersion: v1
kind: Pod
metadata:
  annotations:
  creationTimestamp: "2019-11-05T09:08:38Z"
  generateName: envoy-67dc7dd6c5-
  labels:
    app: envoy
    component: controller
    pod-template-hash: 67dc7dd6c5
  name: envoy-67dc7dd6c5-8m6hc
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: envoy-67dc7dd6c5
    uid: dab0f913-ffab-11e9-92cd-025000000001
# snip

But implementing this straight forward requires us to change the configuration syntax, maybe adding wait_until_ready_deployment: DEPLOYMENT_NAME, deprecating and removing the existing wait_until_ready_selector. I'd hope we can avoid that if possible.

@superbrothers WDYT? Did you have any reason to avoid relying on the "revision" number?

Nov 16 '19 23:11 mumoshu

I would be in favor of a wait_until_ready_deployment (or something similar) option. It's not quite as flexible as the label selector but it probably covers most of the common use cases.

If we're going that route we could probably make use of the kubectl rollout status command too, rather than digging through revision numbers and all that. This command will wait for the deployment to successfully roll out or will return an error if the timeout is hit. This is actually how I'm getting around this race condition for now. I have a pipeline YML that looks like this:

- put: prod-kube
  params:
    kubectl: apply -f ymls/my-app-deployment.yml
- put: prod-kube
  params:
    kubectl: rollout status deployment/my-app --timeout 60s

Nov 18 '19 20:11 mattdodge

I'm sorry for the late reply.

But implementing this straight forward requires us to change the configuration syntax, maybe adding wait_until_ready_deployment: DEPLOYMENT_NAME, deprecating and removing the existing wait_until_ready_selector. I'd hope we can avoid that if possible.

Adding wait_util_ready_<resource> is not flexible, so I don't want to do it as much as possible.

If we're going that route we could probably make use of the kubectl rollout status command too, rather than digging through revision numbers and all that.

Yes, I think so that this is right way. However, wait_until_ready is a required parameter, so it is a breaking change to be changed to a option parameter. At this time, I recommend that you sets wait_until_ready to 0. 0 means don't wait. (I know this is too verbose...)

- put: prod-kube
  params:
    kubectl: apply -f ymls/my-app-deployment.yml
    wait_until_ready: 0
- put: prod-kube
  params:
    kubectl: rollout status deployment/my-app --timeout 60s
    wait_until_ready: 0

Nov 28 '19 07:11 superbrothers

I will consider to delete wait_until_ready param and bumps up the major version.

Nov 28 '19 07:11 superbrothers