dm Support syncing when disk is very slow

Feature Request

Is your feature request related to a problem? Please describe:

A user deploy DM in an environment that has very slow disk. that has help revealing some BUG of DM such as

[ ] https://github.com/pingcap/dm/issues/1377
[ ] worker recieved a bound watch, but failed to read bound information in etcd and didn't retry or kill itself
[ ] query-status shows nothing, while can't add task because of already exists (not enough information in log)

Describe the feature you'd like:

[ ] expose more etcd error and metrics (already in https://github.com/pingcap/dm/issues/1219, https://github.com/pingcap/dm/issues/1218), and warn when disk is bad
[ ] test DM in slow disk

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

Jan 18 '21 10:01 lance6716

I have used chaosmesh to try imitate a bad disk environment, but not effective to reveal bugs. We might try use failpoint with percent probability to inject into etcd API, and check if it will cause inconsistency in DM.

@zeminzhou

Feb 04 '21 11:02 lance6716

(removed the BUG label because we need further investigating if it's has been fixed)

Apr 09 '21 01:04 lance6716