clusterfuzz
clusterfuzz copied to clipboard
"Redo task" not returning result after 4 days
Steps:
- Click on Redo Task(s)
- Check "Check if bug still reproduces"
- Submit form
- Notice that log says "Redo task(s) progression" (this is a confusing message for me as a user, because I don't know what the progression task is.)
- Wait for 4 days
What happens: Nothing
Expected Bug is updated with results within a few hours
Examples: https://clusterfuzz.com/testcase-detail/4915529786392576 https://clusterfuzz.com/testcase-detail/5071253153841152 https://clusterfuzz.com/testcase-detail/4536267162058752 https://clusterfuzz.com/testcase-detail/4587131707392000
Maybe the system could output an expected wait time? Even so, I think 4 days is too slow... I'm guessing it has a priority queue and a human asking to "Redo task" doesn't rise high enough in the queue.
Or maybe an updating counter like, "there are 357 jobs in front of this one", so that its clear something is happening.
@oliverchang - can you please take a look, if this is related to recent bumping up of bots quota ?
The bumping up of quota looks to have actually fixed thing somewhat -- I see tasks getting picked up now.
Looks like this is because many existing instances are hanging again, i'll restart them.
We need to auto-restart non-preemptible instances after some time, or add some kind of health check to do this.
Sorry for the trouble @aleventhal, we'll make sure to fix this going forward.
Thanks no problem!
What did you think of my other suggestions:
- A more meaningful message for "Progression task started"
- Some kind of status where it is in the queue
Three of them still seem stuck: https://clusterfuzz.com/testcase-detail/4915529786392576 https://clusterfuzz.com/testcase-detail/5844125396828160 https://clusterfuzz.com/testcase-detail/4587131707392000
All of these ones are actually because these testcases are no longer considered to be valid (fixed == NA
) because of some issue with reproduction. Our testcase messages don't do a good job of indicating that we skip these.
Note that in such cases, we should also auto-close the associated monorail issue automatically after 14 days (if we don't see other similar crashes happening in that time)
We need to fix this (adding a message in this case) in combination with https://github.com/google/clusterfuzz/issues/2360 to make this clearer.
What does it mean for a testcase to be invalid? We cant' still run the testcase and see if it crashes?