heroic
heroic copied to clipboard
Analyse Heroic and user's perspective when hitting a timeout. Then implement necessary changes.
Implement Findings
Use Case Resolved: unknown Bigtable timeout & retry behaviour
- I am a Heroic developer
- who wants to make sure that our timeout & retry changes will work as expected
- so that we don't have a disaster on our hands!
Design & Implementation Notes
- this ticket was spun up off of peterk & adam's chat w.r.t. upgrading the Bigtable java client lib
- this PR depends a lot on what’s discovered as part of “Discover how Heroic reacts to BigtableRetriesExhaustedException” (Issue #741), for which you should check out the Heroic API Request Timeout Settings document for the latest updates/findings
- currently it’s looking like we want a 6s timeout across the board
- but we don’t know why 65 data channels are created and why there are 72 timeouts logged before the 6s timeout occurs
Concrete changes to make
- delete maxElapsedBackoffMs - it’s irrelevant for our Use Case (double check with Adam about his as it’s still in his doc)
- upgrade the Bigtable java lib from 1.12.1 to 1.18.1 and check that we see 2x retries before the 6s timeout (see Slack thread with Adam / testing Doc). We know we need this as it resolves 2 or 3 retry-related bugs.
- fine-tune all our *TimeoutMs and *TimeoutMillis settings
we finally know how API clients will experience the BT Timeout :
- they’ll get a 200. Which is … interesting. And misleading IMO.
- they get no results. Well, at least for the query I was running.
- they get the following error message in the response body in errors[0].error : "Some fetches failed (100) or were cancelled (870), caused by Some fetches failed (100) or were cancelled (870)"
we (Prism) need to decide if the above is acceptable. @malish8632 's argument is that this is the current behaviour and always has been, hence no changes are necessary.
My argument is that many more users who've never received a timeout before will now receive one, in the shape of a 200, which is super misleading cos it actually failed, in effect.
Hence the agreed-upon action is to phrase the 1, 2, 3 above as a "reminder" of how heroic returns requests that time out.