build Timeout of build may have problem

/area test-and-release /kind bug /kind dev

Expected Behavior

the job resource in kuberenets has an active deadline seconds (ADS), it can be said that, the timeout of build controller is very similar to it

currently, build controller handle a build info from the workqueue and the workqueue is fill by build informer, aka when

user create / update a build resource
pod changes
informer resync happen ( resync period is not 0)

and the resync time is set to 30s

https://github.com/knative/build/blob/4dfa5dabd59de26ef5607b02ac7a536cfb05ecea/cmd/controller/main.go#L91

so if the build timeout is less than 30s, the build may live longer than actual timeout time

Additional Info

kubernetes job controller relative issues Too long time needed for a job to be killed after exceeding relatively short ActiveDeadlineSeconds

Sep 05 '18 13:09 zrss

@zrss: GitHub didn't allow me to assign the following users: user.

Note that only knative members and repo collaborators can be assigned. For more information please see the contributor guide

In response to this:

/area test-and-release /kind bug /kind dev

Expected Behavior

the job resource in kuberenets has an active deadline seconds (ADS), it can be said that, the timeout of build controller is very similar to it

currently, build controller handle a build info from the workqueue and the workqueue is fill by build informer, aka when

user create / update a build resource

pod changes

informer resync happen ( resync period is not 0)

and the resync time is set to 30s

https://github.com/knative/build/blob/4dfa5dabd59de26ef5607b02ac7a536cfb05ecea/cmd/controller/main.go#L91

so if the build timeout is less than 30s, the build may live longer than actual timeout time

Additional Info

kubernetes job controller relative issues Too long time needed for a job to be killed after exceeding relatively short ActiveDeadlineSeconds

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 05 '18 13:09 knative-prow-robot

cc @shashwathi, could u take look, it seems a fresh feature ^_^

Sep 05 '18 14:09 zrss

To clarify, this is not about re-requesting the build to start again after it exceeds its timeout.

It's about enqueueing a work item to check for timeout, at the build's specified timeout time, instead of relying on our periodic checks (every 30s) to detect timeout. This would be a good improvement since we could get <30s granularity in our timeout checks.

Is it possible to remove work items when we see the build has completed? Is there a significant cost to keeping ignorable work items in the queue?

Sep 05 '18 14:09 imjasonh

Is it possible to remove work items when we see the build has completed?

Deleting the build will make the resource completely ghosted right? How will the user check for build status later?

Is there a significant cost to keeping ignorable work items in the queue?

Seems like it should not be an issue because once the build is marked as Done, it will not be enqueued again. Ref

I think @ImJasonH mentioned the idea of using Watch API to have more granularity in our timeout checks. I did a POC but the problem I ran into was there is no event generated when build never gets any resource allocated like pod timeout. There was not much value by moving away from pod informers model.

Having more granularity in timeouts would definitely be a good improvement.

Sep 05 '18 15:09 shashwathi

Deleting the build will make the resource completely ghosted right? How will the user check for build status later?

We shouldn't delete the Build itself, but maybe we can delete the work item to check on the build's status later, once we've seen the build complete.

Sep 05 '18 15:09 imjasonh

thanks for apply, sorry for my poor english (er, i am not a english speaker)

some point of my view

Is it possible to remove work items when we see the build has completed?

add a goroutine to walk through workqueue ? but it seems it's almost the same with doing in syncHandler

Sep 05 '18 15:09 zrss

once we've seen the build complete

a common case: informer of pod can seen the build complete, and update the build status, that's the time build controller syncHandler can handler build info updates

in this time, we can safely remove the work item, but do that in a queue, maybe not a easy job ?

Sep 05 '18 15:09 zrss

Is it possible to remove work items when we see the build has completed?

We already do that in build controller. syncHandler does not return error for successful/completed builds and its removed from queue.

Ref line

Sep 05 '18 15:09 shashwathi

@shashwathi I was looking for exactly that line, thanks for finding it. I think if it's not too ugly in the code, we should try to Forget() the work item we enqueue to check the timeout, since I suspect the vast majority of those work items will be no-ops since most builds won't time out.

Sep 05 '18 16:09 imjasonh

Instead of trying to Forget() the work item in the code, how about not returning error for timedout builds. As of now once build times out, we delete all corresponding resources and update build status so there is no value in throwing timeout error to caller(which will put the item back in queue again).

The contract would be then become for successful/completed/timedout builds syncHanlder would remove the item from queue.

We need to remove this line to accomplish this.

@ImJasonH @zrss : WDYT?

Sep 05 '18 17:09 shashwathi

thanks for reply @shashwathi , sure, we should remove this line

https://github.com/knative/build/blob/2d57fac6d8d294b5c82bcb70df32ef16d817d5c3/pkg/controller/build/controller.go#L306

as it won't call Forget() when the build ends with Timeout

i wonder about the retry mechanism, so i check the code about it, and i find that

Forget() it just clear the failure statistic info of the key (this info we can use for retry times check or backoff time calculation)
Done() will remove the key from queue

and if the same key next time add from AddRateLimited

https://github.com/knative/build/blob/2d57fac6d8d294b5c82bcb70df32ef16d817d5c3/pkg/controller/build/controller.go#L235-L243

that will be delay by the time which count by the failure statistic info

in a word, it seems return err, and with no Forget() call, it won't auto trigger enqueue

Sep 05 '18 21:09 zrss

i found something new about this topic after some efforts

RateLimiter queue has these api except for remove, delete and shutdown
- Add
- Get
- Forget
- Done
Forget() will delete the failures data of a work item and Done() will delete a work item from the processing set, sorry for my mistakes, both methods won't change the actual queue item
we can't remove a item from workqueue as it does't have a remove or delete api. and the only method for removing an item from queue, it's the Get method. but we can't just Get all items in workqueue to check a useless work item for timeout checking, and put all the other normal work item back to the queue, it's not a good thing to do that, WDYT?
i think one more item in workqueue for timeout checking is not a big deal for the normal operation of controller. and though build timeout is a rare case in a product env, but how about dev env ? why i say that, we only add one more chance to quickly check whether the build is timeout, it's worth to do that and it's harmless
and there is a case that a user updates the timeout of build, we should handled it too
currently, build controller doesn't handle a delete event from informer ? why ? maybe i could open a new issue to discuss that, i doubt it should be some clean job we should do, when the build is deleted (another story)

so i propose these changes

re-enqueue the key of build object (er, with a build.Spec.Timeout delay) after the build execute
handling the case that a timeout of build is updated (in a new updateBuild method)

u can see these changes in https://github.com/knative/build/pull/333

still open for discussion ~

kindly ping @ImJasonH , @shashwathi , @mattmoor

Sep 08 '18 13:09 zrss

A simpler solution might be to do the following:

after adding the Build to the workqueue, also use AddAfter() to enqueue it again after the timeout duration
if the build is successful, calling Done() will remove both items from the workqueue
if the build has an error, it will try it again, but only until the it reaches the timeout
If the build time out, it will be immediately removed

@shashwathi ?

Sep 10 '18 18:09 nader-ziada

Thanks @zrss for detailed comment about workqueue. That was very helpful to understand underlying operations of Queue.

In the spirt of containing the discussion to reducing the build timeout granularity from 30sec into anything less, I am not in favor the 2nd suggestion @zrss has proposed. I would recommend a separate issue for updating build timeout.

I like the idea of enqueuing the build with timeout delay.

Another solution: Current configuration of 30sec(syncHandler) is hard coded in build controller. If an operator wants more granularity to sync builds, we could allow this sync time to be configurable in build controller. I am proposing this solution because this will also result in similar end result as adding build with timeout delay i.e., forcing build objects to sync more frequently.

Sep 10 '18 19:09 shashwathi