fdb-record-layer
fdb-record-layer copied to clipboard
Allow cursors to stop rather than throw an error on transaction timeouts or retriable errors
For some use cases, it would be useful to have an ExecuteState option that would allow cursors to stop when they hit a transaction timeout (at least) or perhaps any retriable FDB exception rather than rethrowing the error. For example, in the case where one is doing an inconsistent read, it would make sense to do something where the transaction is allowed to just read until it hits the five second transaction time limit, then restart the transaction from there.
I would propose adding a new option to the ExecuteProperties à la failOnScanLimitReached. If that option is set, then the idea is that cursors should catch any transaction time outs (or perhaps any retriable errors in general) and return a RecordCursorResult with a new NoNextReason, something like TRANSACTION_TIMEOUT or RETRIABLE_ERROR, which indicates that the scan terminated early. (This NoNextReason would be an out-of-band exception just like, say, the TIME_LIMIT_REACHED exception.) The user could then resume execution from where the last cursor left off.
I think adding support for this new termination condition could be done by only editing a few different places, chiefly the KeyValueCursor and maybe some of the other cursors that talk directly to FDB rather than going through the KeyValueCursor. This is more-or-less the same set of places that might use the CursorScanLimiter anyway, I think.
There are some possible draw backs to this, mainly that if there's some kind of problem where FDB is down or has an incompatible version, it can result in retriable errors, which may just mean that the user would want to limit the number of retries that are done on a RETRIABLE_ERROR. However, plumbing through the error may be difficult with the current RecordCursorResult API (as the NoNextReason is just an enum), so the best we may be able to get is logging the error.
A few things came up during a discussion about this with some members of the team:
- There are probably a few other places that we'd need to consider adding error checking. For example, the
MapPipelinedCursorcould hit atransaction_too_olderror when resolving a record, so it's possible that any cursor that would ever do database work in any of its callbacks could hit this - It's possible that much the same thing could be achieved with an an "error translating cursor" that took a base cursor and had an
onNextmethod that caught atransaction_too_old(or whatever) and translated the cursor element into aRecordCursorResultwith the properNoNextReason. This could also limit the surface area of this change by limiting the error catching to a single place. - Another place where this would be useful would be the
AutoContinuingCursor, which we may want to let catch these errors and restart a cursor if it sees them. This could also be combined with an error catching cursor, with theAutoContinuingCursorwrapping its produced cursors in the error-catching cursor