zoekt indexserver: backoff repo if failing repeatedly

GH issue: https://github.com/sourcegraph/sourcegraph/issues/42011

Use case: If a repository fails indexing attempts repeatedly then we should treat it as if the subsequent indexing attempts will fail.

This is a WIP since I'd like to open discussion to the team before committing to a solution. The current solution as of 221012 is to prevent pushing an item to the priority queue's heap structure if a repository has failed one or more of its preceding indexing attempts. The duration of ignoring indexing attempts for a given repository increases linearly (up to a configurable ceiling) as consecutive failures increases. We track additional information in the queue and queueItem to manage backoff.

This solution is meant to be something we could ship fairly quickly but we can certainly expand on the solution to be more sophisticated (see list below). Enhancements to the design (ranging from quick fixes to long term considerations) should be discussed in this PR, and we can determine what we'd like to implement now or defer to new GH tickets or rfcs.

Some design considerations/questions:

We never apply indefinite backoff. Should we ever apply an indefinite backoff such that a repository will not index until some intervention by someone/something external to the queue? If so, then how do we identify when a misbehaving repository is now in a state that is indexable? a. example of automatic intervention: we detect index options changed and are optimistic that now index will succeed. b. example of manual intervention: admin runs debug command similar to force index, but now it forces enqueue (and clearing of the backoff state associated with given repo's queueItem.
Currently we will not attempt to index until the backoff duration has passed. Should we consider removing the backoff in response to some changes external to indexserver? a. Same as 1.a. example: we detect index options changed and are optimistic that now index will succeed. b. On one hand this could improve resolution time to achieving indexed repo. c. Alternatively we won't know if a change will resolve anything. Perhaps we maintain consecutive Failures count but reset duration.
Should we move handling of backoff out of the queue structure? a. This could mean tracking which repos to backoff from in Server b. Alternatively, we could use a second priority queue with the min heap prioritizing the repo with the earliest reinstate time, every Sync with Sourcegraph cycle we just pop until peek gives a reinstate time in the future.
Should we persist which repos we identified as misbehaving on disk so that on restart we continue backoff strategy from where it left off? Currently queue will wipe on shutdown and we therefore lose this insight on startup.
Should we consider non consecutive failures in our decision to backoff? e.g. alternating b/w failure and success means wasted effort and we'd still consider this success rate as misbehavior.
Allow for admins to manually apply backoff via site admin page or another user friendly gui? a. This is a long term enhancement that we likely need to spin off to a separate ticket.
Implement interface for queueItem so that queue processes implementations which each differ in backoff strategies at startup, based on a flag or Env Var, a. This is a long term enhancement that we likely need to spin off to a separate ticket. b. I see value in this because each customer might have different preference. Instead of trying to narrow down best effort estimate of a universal backoff strategy we could have several options available to choose from. c. Some options we could provide that implementation for: i. linear backoff ii. exponential backoff iii. no backoff iv. (i.) or (ii.) with indefinite backoff allowed

d. If we implement (7.) we could even have each repository implement its own backoff strategy (again, this would be a long term enhancement to complete in a separate ticket).

Oct 13 '22 04:10 gl-srgr

Is this ready for review? I just responded to your test change.

FYI it can be tricky to test time based stuff using the queue interface. To get more test confidence you can try extracting out the backoff stuff into its own struct, then unit testing that. Then you can treat it more like a blackbox inside of the queue and have more minimal testing there (eg just test when it says no, it is no, it gets backoff updates, etc)

Thanks for the feedback @keegancsmith! I will add backoff unit tests. As of now backoff stuff is extracted into its own struct which I'll test, but if there is any suggestion re: further extraction let me know. I agree that sleep is not ideal compromises predictability and confidence as we test across environments (CI environment being most important). If we decide to keep some and discard flakey ones I'd be okay with that. As you called out in the other comment the tests vary quite a bit from reliable to unpredictable/flakey, so ideally we keep some just for the extra code coverage.

Oct 26 '22 02:10 gl-srgr

@keegancsmith @stefanhengl I removed one of the flakey tests and unit tests were added earlier. I'm open to removing other queue tests that might be flakey.

Btw this ticket is not as high priority as the other work atm so feel free to defer this to next week if y'all have less cycles this week and early next week.

Oct 28 '22 03:10 gl-srgr