flink-connectors
flink-connectors copied to clipboard
Improve error handling for transactional writer commit
Problem description
We use a two-phase commit algorithm with Flink checkpoint and Pravega transactional writer to implement the end to end exactly-once feature, see #5 in detail. In the second phase, we call the transaction.commit
for the final checkpoint commit, but the error handling is not done, hence we may encounter data loss when the commit
call is not done.
The commit
call will throw a TxnFailedException
if something is wrong. It can be either of these two situation.
- Server accepted the request but there was some problem which caused the failure.
- Server failed to even accept the request.
For the second case, which we can tell from the status of the transaction, the client should do a commit
retry to avoid data loss.
With https://github.com/pravega/pravega/issues/4822 fixed, flink connector can deal with such cases in a better manner.
Problem location
FlinkPravegaWriter
Suggestions for an improvement Some debug logs can be added for better monitoring.