kafka-connect-hdfs icon indicating copy to clipboard operation
kafka-connect-hdfs copied to clipboard

RecoveryHelper to speed up recovery after restart

Open justpresident opened this issue 5 years ago • 6 comments

This patch introduces a change in WAL and recovery process. The idea is to avoid expensive scan of all table files that could take longer than one hour for large tables and maintain a recovery record in the WAL instead. This record is written in the beginning of WAL log and it is not surrounded with 'begin and 'end' markers. There are following situations possible: a. There is no recovery record, there are normal records in WAL b. There is no recovery record, no other records in WAL c. There is a recovery record, there are normal records in WAL d. There is a recovery record, no other records in WAL Since recovery record is written in the beginning, then it contains the latest offset only in a case when there is nothing else in the log, or other records are invalid(temp files are deleted). So in cases a,c and d recovery process will pick the committed file from WAL with highest offset - either from recovery record or from normal records. In case (b) when WAL log is empty or doesn't exist - latest offset will be discovered through full recursive folder scan.

justpresident avatar Nov 18 '19 12:11 justpresident

@confluentinc It looks like @justpresident just signed our Contributor License Agreement. :+1:

Always at your service,

clabot

ghost avatar Nov 18 '19 12:11 ghost

@kkonstantine would you please check this out? we've been running this in production for a while now

alexandrfox avatar Dec 02 '19 13:12 alexandrfox

@justpresident Thanks for making this PR. I'm curious what sort of speedup you are seeing in your environment?

The speedup of course depends on the number of existing files in the table. The initial scan, that usually takes around 1 hour for large tables is eliminated completely. The startup is now instant

justpresident avatar Dec 23 '19 15:12 justpresident

Hello, Any updates on this PR?

pedro93 avatar Mar 02 '22 10:03 pedro93

I don't work with kafka-connect anymore and don't have such a setup with thousands of hdfs files to test, but it seems like the problem was solved in a very similar way in https://github.com/confluentinc/kafka-connect-hdfs/pull/556 Can someone please test and if there is no problem, this PR can be closed

justpresident avatar Dec 27 '22 22:12 justpresident

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Roman Studenikin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

cla-assistant[bot] avatar Aug 27 '23 12:08 cla-assistant[bot]