flink-connectors icon indicating copy to clipboard operation
flink-connectors copied to clipboard

Improve error handling for transactional writer commit

Open crazyzhou opened this issue 4 years ago • 0 comments

Problem description We use a two-phase commit algorithm with Flink checkpoint and Pravega transactional writer to implement the end to end exactly-once feature, see #5 in detail. In the second phase, we call the transaction.commit for the final checkpoint commit, but the error handling is not done, hence we may encounter data loss when the commit call is not done.

The commit call will throw a TxnFailedException if something is wrong. It can be either of these two situation.

  1. Server accepted the request but there was some problem which caused the failure.
  2. Server failed to even accept the request.

For the second case, which we can tell from the status of the transaction, the client should do a commit retry to avoid data loss.

With https://github.com/pravega/pravega/issues/4822 fixed, flink connector can deal with such cases in a better manner.

Problem location FlinkPravegaWriter

Suggestions for an improvement Some debug logs can be added for better monitoring.

crazyzhou avatar May 22 '20 05:05 crazyzhou