jesterj Fast Scan Resume

Fast Scan Resume

Open nsoft opened this issue 1 year ago • 1 comments

The current fault tolerance achieves it's goal but if it resumes a very large scan it will spend a period of time hashing documents and determining that it has already seen them. We would like to provide a configuration option (via a method on the builder for the scanner) to skip this and pick up where we left off without wasting as much CPU.

One possible route for this is to log the scanned id's after we've reported status for the initial document, and then load that log of id's into a Trie structure that can be used to check the id's directly without hashing document contents. (Hashing still remains and is required for subsequent scan). Completing a scan should clear out the log preventing this Trie from being built if the previous scan completed successfully.

Mar 30 '23 19:03 nsoft

Also need to think about the possibility that we could avoid the log and just mine the scanner_doc_hash table...

Mar 30 '23 19:03 nsoft

jesterj jesterj copied to clipboard

Fast Scan Resume

jesterj
jesterj copied to clipboard