fdb-record-layer icon indicating copy to clipboard operation
fdb-record-layer copied to clipboard

Allow cursors to stop rather than throw an error on transaction timeouts or retriable errors

Open alecgrieser opened this issue 3 years ago • 1 comments

For some use cases, it would be useful to have an ExecuteState option that would allow cursors to stop when they hit a transaction timeout (at least) or perhaps any retriable FDB exception rather than rethrowing the error. For example, in the case where one is doing an inconsistent read, it would make sense to do something where the transaction is allowed to just read until it hits the five second transaction time limit, then restart the transaction from there.

I would propose adding a new option to the ExecuteProperties à la failOnScanLimitReached. If that option is set, then the idea is that cursors should catch any transaction time outs (or perhaps any retriable errors in general) and return a RecordCursorResult with a new NoNextReason, something like TRANSACTION_TIMEOUT or RETRIABLE_ERROR, which indicates that the scan terminated early. (This NoNextReason would be an out-of-band exception just like, say, the TIME_LIMIT_REACHED exception.) The user could then resume execution from where the last cursor left off.

I think adding support for this new termination condition could be done by only editing a few different places, chiefly the KeyValueCursor and maybe some of the other cursors that talk directly to FDB rather than going through the KeyValueCursor. This is more-or-less the same set of places that might use the CursorScanLimiter anyway, I think.

There are some possible draw backs to this, mainly that if there's some kind of problem where FDB is down or has an incompatible version, it can result in retriable errors, which may just mean that the user would want to limit the number of retries that are done on a RETRIABLE_ERROR. However, plumbing through the error may be difficult with the current RecordCursorResult API (as the NoNextReason is just an enum), so the best we may be able to get is logging the error.

alecgrieser avatar Mar 13 '21 00:03 alecgrieser

A few things came up during a discussion about this with some members of the team:

  1. There are probably a few other places that we'd need to consider adding error checking. For example, the MapPipelinedCursor could hit a transaction_too_old error when resolving a record, so it's possible that any cursor that would ever do database work in any of its callbacks could hit this
  2. It's possible that much the same thing could be achieved with an an "error translating cursor" that took a base cursor and had an onNext method that caught a transaction_too_old (or whatever) and translated the cursor element into a RecordCursorResult with the proper NoNextReason. This could also limit the surface area of this change by limiting the error catching to a single place.
  3. Another place where this would be useful would be the AutoContinuingCursor, which we may want to let catch these errors and restart a cursor if it sees them. This could also be combined with an error catching cursor, with the AutoContinuingCursor wrapping its produced cursors in the error-catching cursor

alecgrieser avatar Mar 18 '21 21:03 alecgrieser