Improve error handling during failed modification

Open jasonmp85 opened this issue 11 years ago • 0 comments

pg_shard's modification logic assumes that any total failure is due to something transient that a retry might overcome. In many cases, an INSERT or UPDATE could fail due to a constraint check, which is not something that a simple retry will overcome without something else changing.

See #31 for an example of what I mean. In its example, the client sees:

# WARNING:  Bad result from shard1.demo:5432
# DETAIL:  Remote message: duplicate key value violates unique constraint "members_id_key_10007"
# WARNING:  Bad result from shard8.demo:5432
# DETAIL:  Remote message: duplicate key value violates unique constraint "members_id_key_10007"
# ERROR:  could not modify any active placements

A well-written application might want to handle the uniqueness violation in a special fashion, but all pg_shard gives it is a generic error about not being able to modify any placements.

We probably want to try a modification on a placement, then:

If the error is in the class of things we think a user cares about (constraints, etc.), we fail-fast and throw them the error
If the error is network related or otherwise "transient", we continue with the remaining shards. If any modification completes, we mark the transient-failure shard as bad

At a higher level, we need to handle modification outcomes in a ternary fashion:

Total Success — the modification completed successfully
Application Failure — the modification returned successfully, but the remote DB raised an error
Infrastructure Failure — the modification didn't even complete, or did so with a network error

Only the third case is deserving of a "could not modify placement" error. In the second we can fail-fast and tell the user what happened.

Dec 26 '14 18:12 jasonmp85