build
build copied to clipboard
Timeout of build may have problem
/area test-and-release /kind bug /kind dev
Expected Behavior
the job resource in kuberenets has an active deadline seconds (ADS), it can be said that, the timeout of build controller is very similar to it
currently, build controller handle a build info from the workqueue
and the workqueue is fill by build informer, aka when
- user create / update a build resource
- pod changes
- informer resync happen ( resync period is not 0)
and the resync time is set to 30s
https://github.com/knative/build/blob/4dfa5dabd59de26ef5607b02ac7a536cfb05ecea/cmd/controller/main.go#L91
so if the build timeout is less than 30s, the build may live longer than actual timeout time
Additional Info
kubernetes job controller relative issues Too long time needed for a job to be killed after exceeding relatively short ActiveDeadlineSeconds
@zrss: GitHub didn't allow me to assign the following users: user.
Note that only knative members and repo collaborators can be assigned. For more information please see the contributor guide
In response to this:
/area test-and-release /kind bug /kind dev
Expected Behavior
the job resource in kuberenets has an active deadline seconds (ADS), it can be said that, the timeout of build controller is very similar to it
currently, build controller handle a build info from the
workqueue
and the workqueue is fill by build informer, aka when
- user create / update a build resource
- pod changes
- informer resync happen ( resync period is not 0)
and the resync time is set to 30s
https://github.com/knative/build/blob/4dfa5dabd59de26ef5607b02ac7a536cfb05ecea/cmd/controller/main.go#L91
so if the build timeout is less than 30s, the build may live longer than actual timeout time
Additional Info
kubernetes job controller relative issues Too long time needed for a job to be killed after exceeding relatively short ActiveDeadlineSeconds
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
cc @shashwathi, could u take look, it seems a fresh feature ^_^
To clarify, this is not about re-requesting the build to start again after it exceeds its timeout.
It's about enqueueing a work item to check for timeout, at the build's specified timeout time, instead of relying on our periodic checks (every 30s) to detect timeout. This would be a good improvement since we could get <30s granularity in our timeout checks.
Is it possible to remove work items when we see the build has completed? Is there a significant cost to keeping ignorable work items in the queue?
Is it possible to remove work items when we see the build has completed?
Deleting the build will make the resource completely ghosted right? How will the user check for build status later?
Is there a significant cost to keeping ignorable work items in the queue?
Seems like it should not be an issue because once the build is marked as Done
, it will not be enqueued again. Ref
I think @ImJasonH mentioned the idea of using Watch API to have more granularity in our timeout checks. I did a POC but the problem I ran into was there is no event generated when build never gets any resource allocated like pod timeout. There was not much value by moving away from pod informers model.
Having more granularity in timeouts would definitely be a good improvement.
Deleting the build will make the resource completely ghosted right? How will the user check for build status later?
We shouldn't delete the Build itself, but maybe we can delete the work item to check on the build's status later, once we've seen the build complete.
thanks for apply, sorry for my poor english (er, i am not a english speaker)
some point of my view
Is it possible to remove work items when we see the build has completed?
add a goroutine to walk through workqueue
? but it seems it's almost the same with doing in syncHandler
once we've seen the build complete
- a common case: informer of pod can seen the build complete, and update the build status, that's the time build controller
syncHandler
can handler build info updates
in this time, we can safely remove the work item, but do that in a queue, maybe not a easy job ?
Is it possible to remove work items when we see the build has completed?
We already do that in build controller. syncHandler
does not return error for successful/completed builds and its removed from queue.
@shashwathi I was looking for exactly that line, thanks for finding it. I think if it's not too ugly in the code, we should try to Forget()
the work item we enqueue to check the timeout, since I suspect the vast majority of those work items will be no-ops since most builds won't time out.
Instead of trying to Forget()
the work item in the code, how about not returning error for timedout builds. As of now once build times out, we delete all corresponding resources and update build status so there is no value in throwing timeout error to caller(which will put the item back in queue again).
The contract would be then become for successful/completed/timedout builds syncHanlder
would remove the item from queue.
We need to remove this line to accomplish this.
@ImJasonH @zrss : WDYT?
thanks for reply @shashwathi , sure, we should remove this line
https://github.com/knative/build/blob/2d57fac6d8d294b5c82bcb70df32ef16d817d5c3/pkg/controller/build/controller.go#L306
as it won't call Forget()
when the build ends with Timeout
i wonder about the retry mechanism, so i check the code about it, and i find that
-
Forget()
it just clear the failure statistic info of the key (this info we can use for retry times check or backoff time calculation) -
Done()
will remove the key from queue
and if the same key next time add from AddRateLimited
https://github.com/knative/build/blob/2d57fac6d8d294b5c82bcb70df32ef16d817d5c3/pkg/controller/build/controller.go#L235-L243
that will be delay by the time which count by the failure statistic info
in a word, it seems return err, and with no Forget()
call, it won't auto trigger enqueue
i found something new about this topic after some efforts
- RateLimiter queue has these api except for remove, delete and shutdown
- Add
- Get
- Forget
- Done
-
Forget()
will delete thefailures
data of a work item andDone()
will delete a work item from theprocessing
set, sorry for my mistakes, both methods won't change the actual queue item - we can't remove a item from workqueue as it does't have a remove or delete api. and the only method for removing an item from queue, it's the
Get
method. but we can't justGet
all items in workqueue to check a useless work item for timeout checking, and put all the other normal work item back to the queue, it's not a good thing to do that, WDYT? - i think one more item in workqueue for timeout checking is not a big deal for the normal operation of controller. and though build timeout is a rare case in a product env, but how about dev env ? why i say that, we only add one more chance to quickly check whether the build is timeout, it's worth to do that and it's harmless
- and there is a case that a user updates the timeout of build, we should handled it too
- currently, build controller doesn't handle a delete event from informer ? why ? maybe i could open a new issue to discuss that, i doubt it should be some clean job we should do, when the build is deleted (another story)
so i propose these changes
- re-enqueue the key of build object (er, with a build.Spec.Timeout delay) after the build
execute
- handling the case that a timeout of build is updated (in a new
updateBuild
method)
u can see these changes in https://github.com/knative/build/pull/333
still open for discussion ~
kindly ping @ImJasonH , @shashwathi , @mattmoor
A simpler solution might be to do the following:
- after adding the Build to the workqueue, also use AddAfter() to enqueue it again after the timeout duration
- if the build is successful, calling Done() will remove both items from the workqueue
- if the build has an error, it will try it again, but only until the it reaches the timeout
- If the build time out, it will be immediately removed
@shashwathi ?
Thanks @zrss for detailed comment about workqueue. That was very helpful to understand underlying operations of Queue.
In the spirt of containing the discussion to reducing the build timeout granularity from 30sec into anything less, I am not in favor the 2nd suggestion @zrss has proposed. I would recommend a separate issue for updating build timeout.
I like the idea of enqueuing the build with timeout delay.
Another solution: Current configuration of 30sec(syncHandler) is hard coded in build controller. If an operator wants more granularity to sync builds, we could allow this sync time to be configurable in build controller. I am proposing this solution because this will also result in similar end result as adding build with timeout delay i.e., forcing build objects to sync more frequently.