clusterfuzz icon indicating copy to clipboard operation
clusterfuzz copied to clipboard

"Redo task" not returning result after 4 days

Open aleventhal opened this issue 3 years ago • 9 comments

Steps:

  1. Click on Redo Task(s)
  2. Check "Check if bug still reproduces"
  3. Submit form
  4. Notice that log says "Redo task(s) progression" (this is a confusing message for me as a user, because I don't know what the progression task is.)
  5. Wait for 4 days

What happens: Nothing

Expected Bug is updated with results within a few hours

Examples: https://clusterfuzz.com/testcase-detail/4915529786392576 https://clusterfuzz.com/testcase-detail/5071253153841152 https://clusterfuzz.com/testcase-detail/4536267162058752 https://clusterfuzz.com/testcase-detail/4587131707392000

aleventhal avatar Jul 12 '21 12:07 aleventhal

Maybe the system could output an expected wait time? Even so, I think 4 days is too slow... I'm guessing it has a priority queue and a human asking to "Redo task" doesn't rise high enough in the queue.

aleventhal avatar Jul 12 '21 12:07 aleventhal

Or maybe an updating counter like, "there are 357 jobs in front of this one", so that its clear something is happening.

aleventhal avatar Jul 12 '21 12:07 aleventhal

@oliverchang - can you please take a look, if this is related to recent bumping up of bots quota ?

inferno-chromium avatar Jul 12 '21 14:07 inferno-chromium

The bumping up of quota looks to have actually fixed thing somewhat -- I see tasks getting picked up now.

oliverchang avatar Jul 13 '21 00:07 oliverchang

Looks like this is because many existing instances are hanging again, i'll restart them.

We need to auto-restart non-preemptible instances after some time, or add some kind of health check to do this.

Sorry for the trouble @aleventhal, we'll make sure to fix this going forward.

oliverchang avatar Jul 13 '21 00:07 oliverchang

Thanks no problem!

What did you think of my other suggestions:

  • A more meaningful message for "Progression task started"
  • Some kind of status where it is in the queue

aleventhal avatar Jul 13 '21 12:07 aleventhal

Three of them still seem stuck: https://clusterfuzz.com/testcase-detail/4915529786392576 https://clusterfuzz.com/testcase-detail/5844125396828160 https://clusterfuzz.com/testcase-detail/4587131707392000

aleventhal avatar Jul 13 '21 15:07 aleventhal

All of these ones are actually because these testcases are no longer considered to be valid (fixed == NA) because of some issue with reproduction. Our testcase messages don't do a good job of indicating that we skip these.

Note that in such cases, we should also auto-close the associated monorail issue automatically after 14 days (if we don't see other similar crashes happening in that time)

We need to fix this (adding a message in this case) in combination with https://github.com/google/clusterfuzz/issues/2360 to make this clearer.

oliverchang avatar Jul 14 '21 07:07 oliverchang

What does it mean for a testcase to be invalid? We cant' still run the testcase and see if it crashes?

aleventhal avatar Jul 14 '21 14:07 aleventhal