Hbase-28014: Transient underlying HDFS failure causes permanent replciation failure between 2 HBase Clusters
See HBase-28014 for details on the symptom and diagnostic.
The fix of HBase-28014 is implemented. In my local machine, it is able to pass all test cases.
I add some configurable retry mechanism to tolerate possible underlying HDFS failure.
Actually, I could add a unit test right now. However, I need to add a Fault Injector class to inject the IOException or override the replicationSourceManager and dynamically include the overriding one in the unit test. I am wondering if that would be a good convention.
Any comments and suggestions would be appreciated.
Thanks for opening a PR. In HBase, we usually first open a PR against the master branch, and then cherry-pick to other branches.
So please open a PR against master branch? Branch-2.0 has been EOL for a long time, I do not think the problem will only affect branch-2.0?