clusterfuzz "Redo task" not returning result after 4 days

Steps:

Click on Redo Task(s)
Check "Check if bug still reproduces"
Submit form
Notice that log says "Redo task(s) progression" (this is a confusing message for me as a user, because I don't know what the progression task is.)
Wait for 4 days

What happens: Nothing

Expected Bug is updated with results within a few hours

Examples: https://clusterfuzz.com/testcase-detail/4915529786392576 https://clusterfuzz.com/testcase-detail/5071253153841152 https://clusterfuzz.com/testcase-detail/4536267162058752 https://clusterfuzz.com/testcase-detail/4587131707392000

Jul 12 '21 12:07 aleventhal

Maybe the system could output an expected wait time? Even so, I think 4 days is too slow... I'm guessing it has a priority queue and a human asking to "Redo task" doesn't rise high enough in the queue.

Jul 12 '21 12:07 aleventhal

Or maybe an updating counter like, "there are 357 jobs in front of this one", so that its clear something is happening.

Jul 12 '21 12:07 aleventhal

@oliverchang - can you please take a look, if this is related to recent bumping up of bots quota ?

Jul 12 '21 14:07 inferno-chromium

The bumping up of quota looks to have actually fixed thing somewhat -- I see tasks getting picked up now.

Jul 13 '21 00:07 oliverchang

Looks like this is because many existing instances are hanging again, i'll restart them.

We need to auto-restart non-preemptible instances after some time, or add some kind of health check to do this.

Sorry for the trouble @aleventhal, we'll make sure to fix this going forward.

Jul 13 '21 00:07 oliverchang

Thanks no problem!

What did you think of my other suggestions:

A more meaningful message for "Progression task started"
Some kind of status where it is in the queue

Jul 13 '21 12:07 aleventhal

Three of them still seem stuck: https://clusterfuzz.com/testcase-detail/4915529786392576 https://clusterfuzz.com/testcase-detail/5844125396828160 https://clusterfuzz.com/testcase-detail/4587131707392000

Jul 13 '21 15:07 aleventhal

All of these ones are actually because these testcases are no longer considered to be valid (fixed == NA) because of some issue with reproduction. Our testcase messages don't do a good job of indicating that we skip these.

Note that in such cases, we should also auto-close the associated monorail issue automatically after 14 days (if we don't see other similar crashes happening in that time)

We need to fix this (adding a message in this case) in combination with https://github.com/google/clusterfuzz/issues/2360 to make this clearer.

Jul 14 '21 07:07 oliverchang

What does it mean for a testcase to be invalid? We cant' still run the testcase and see if it crashes?

Jul 14 '21 14:07 aleventhal

clusterfuzz clusterfuzz copied to clipboard

"Redo task" not returning result after 4 days

clusterfuzz
clusterfuzz copied to clipboard