kafka-connect-hdfs
kafka-connect-hdfs copied to clipboard
RecoveryHelper to speed up recovery after restart
This patch introduces a change in WAL and recovery process. The idea is to avoid expensive scan of all table files that could take longer than one hour for large tables and maintain a recovery record in the WAL instead. This record is written in the beginning of WAL log and it is not surrounded with 'begin and 'end' markers. There are following situations possible: a. There is no recovery record, there are normal records in WAL b. There is no recovery record, no other records in WAL c. There is a recovery record, there are normal records in WAL d. There is a recovery record, no other records in WAL Since recovery record is written in the beginning, then it contains the latest offset only in a case when there is nothing else in the log, or other records are invalid(temp files are deleted). So in cases a,c and d recovery process will pick the committed file from WAL with highest offset - either from recovery record or from normal records. In case (b) when WAL log is empty or doesn't exist - latest offset will be discovered through full recursive folder scan.
@confluentinc It looks like @justpresident just signed our Contributor License Agreement. :+1:
Always at your service,
clabot
@kkonstantine would you please check this out? we've been running this in production for a while now
@justpresident Thanks for making this PR. I'm curious what sort of speedup you are seeing in your environment?
The speedup of course depends on the number of existing files in the table. The initial scan, that usually takes around 1 hour for large tables is eliminated completely. The startup is now instant
Hello, Any updates on this PR?
I don't work with kafka-connect anymore and don't have such a setup with thousands of hdfs files to test, but it seems like the problem was solved in a very similar way in https://github.com/confluentinc/kafka-connect-hdfs/pull/556 Can someone please test and if there is no problem, this PR can be closed
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
Roman Studenikin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.