trxdb-loader should handle temporary tikv failures
PD dynamically (re-) assigns raft leaders that handle certain regions. When this happens during writes it seems that a write might fail due to region misses or leader updates. Such cases should be handled (not sure how, probably just retry the batch put?)
Jul 01 18:17:26 eos-trxdb11 dfuseeos[11633]: time="2020-07-01T18:17:26Z" level=info msg="drop regions that on the store 8(172.22.3.201:20160) due to send request fail, err: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Jul 01 18:17:33 eos-trxdb11 dfuseeos[11633]: 2020-07-01T18:17:33.624Z (trxdb-loader) 5sec AVG INSERT RATE (trxdb-loader/loader.go:316){"block_num": 128000596, "last_tick_block": 128000547, "block_sec": 1.75}
Jul 01 18:18:11 eos-trxdb11 dfuseeos[11633]: time="2020-07-01T18:18:11Z" level=warning msg="backoffer.maxSleep 20000ms is exceeded, errors:\nnot leader: region_id:152233 , ctx: &{{152233 233 17} id:152233 start_key:\"eos-v1-trxdb;\\005\\324\\177a\\264hj
Jul 01 18:18:11 eos-trxdb11 dfuseeos[11633]: 2020-07-01T18:18:11.704Z (launcher)
Jul 01 18:18:11 eos-trxdb11 dfuseeos[11633]: ################################################################
Jul 01 18:18:11 eos-trxdb11 dfuseeos[11633]: Fatal error in app trxdb-loader:
Jul 01 18:18:11 eos-trxdb11 dfuseeos[11633]: process block failed: Leaving ProcessBlock on failed flushAllMutations: regionMiss
Jul 01 18:18:11 eos-trxdb11 dfuseeos[11633]: ################################################################
Jul 01 18:18:11 eos-trxdb11 dfuseeos[11633]: (launcher/launcher.go:174)
Jul 01 18:18:11 eos-trxdb11 dfuseeos[11633]: 2020-07-01T18:18:11.705Z (dfuse) Application trxdb-loader shutdown unexpectedly, quitting (cli/start.go:116)
Jul 01 18:18:11 eos-trxdb11 dfuseeos[11633]: Error: unable to launch: process block failed: Leaving ProcessBlock on failed flushAllMutations: regionMiss
Jul 01 18:18:11 eos-trxdb11 dfuseeos[11633]: 2020-07-01T18:18:11.705Z (cli) dfuse (derr/cli.go:25){"error": "unable to launch: process block failed: Leaving ProcessBlock on failed flushAllMutations: regionMiss"}
Jul 01 18:18:12 eos-trxdb11 systemd[1]: dfuse.service: Main process exited, code=exited, status=1/FAILURE
Jul 01 18:18:12 eos-trxdb11 systemd[1]: dfuse.service: Failed with result 'exit-code'.
Jul 07 23:23:31 eos-trxdb12d dfuseeos[11484]: time="2020-07-07T23:23:31Z" level=warning msg="backoffer.maxSleep 20000ms is exceeded, errors:\nregion not found for key \"eos-v1-trxdb;\\x05g\\x9fÑv\\xb5\\xe7H@\\xb0\\xb90L\\x0e\\xc3\\xfb\\xf1\\x17\\b`w\\xd
Jul 07 23:23:31 eos-trxdb12d dfuseeos[11484]: 2020-07-07T23:23:31.083Z (launcher)
Jul 07 23:23:31 eos-trxdb12d dfuseeos[11484]: ################################################################
Jul 07 23:23:31 eos-trxdb12d dfuseeos[11484]: Fatal error in app trxdb-loader:
Jul 07 23:23:31 eos-trxdb12d dfuseeos[11484]: process block failed: Leaving ProcessBlock on failed flushAllMutations: updateLeader
Jul 07 23:23:31 eos-trxdb12d dfuseeos[11484]: ################################################################
Jul 07 23:23:31 eos-trxdb12d dfuseeos[11484]: (launcher/launcher.go:174)
Jul 07 23:23:31 eos-trxdb12d dfuseeos[11484]: 2020-07-07T23:23:31.083Z (dfuse) Application trxdb-loader shutdown unexpectedly, quitting (cli/start.go:116)
Jul 07 23:23:31 eos-trxdb12d dfuseeos[11484]: Error: unable to launch: process block failed: Leaving ProcessBlock on failed flushAllMutations: updateLeader
Jul 07 23:23:31 eos-trxdb12d dfuseeos[11484]: 2020-07-07T23:23:31.083Z (cli) dfuse (derr/cli.go:25){"error": "unable to launch: process block failed: Leaving ProcessBlock on failed flushAllMutations: updateLeader"}
Jul 07 23:23:31 eos-trxdb12d systemd[1]: dfuse.service: Main process exited, code=exited, status=1/FAILURE
Jul 07 23:23:31 eos-trxdb12d systemd[1]: dfuse.service: Failed with result 'exit-code'.
Yeah, we should have simple retries there, with a backoff. I'll put this issue in the planning.
Actually it seems like most of those failures are a strong indicator that the Tikv cluster is overloaded. So another (maybe better) solution would be to stop, but save the last successfully written block number. That allows to fix Tikv issues / reduce the load and then continue from where it stopped.
Okay there's no way to operate a Tikv cluster without this retry handling. We added a simple solution that waits for number_of_retries * 10 minutes. This helps a lot.