jesterj
jesterj copied to clipboard
Fast Scan Resume
The current fault tolerance achieves it's goal but if it resumes a very large scan it will spend a period of time hashing documents and determining that it has already seen them. We would like to provide a configuration option (via a method on the builder for the scanner) to skip this and pick up where we left off without wasting as much CPU.
One possible route for this is to log the scanned id's after we've reported status for the initial document, and then load that log of id's into a Trie structure that can be used to check the id's directly without hashing document contents. (Hashing still remains and is required for subsequent scan). Completing a scan should clear out the log preventing this Trie from being built if the previous scan completed successfully.
Also need to think about the possibility that we could avoid the log and just mine the scanner_doc_hash table...