suckit Handle SIGINT to save Scraper state

Apr 13 '20 20:04 CohenArthur

@CohenArthur @Skallwar

This could be done simply by on-the-fly comparing (if exists) local output directory whilst walking the scraping operation forward.

From logic wise scraper needs to walk the local directory to establish any unresolved leafs at any given point.

e.g. When getting the initial index/hook point scraper can just compare locally (if exists) to that would be result and then iterate from there via the (if exists) local output skipping until scraper meets some missing resource references that it doesn't yet have locally.

Good thing would be scraper would not need to be intercepting SIGINT as saving state is already baked in.

Even mocks could be able to use those recorded outputs for benchmarks easily this way like flipping a switch for replay.

Follow-up edge case/issue to this will be when the website itself changes but the underlying resources may not.

Tempfiles needed

Since no tempfiles are not yet used and we cannot ensure completeness of resources as of now (if I am guessing right)

Scraper either would need

To issue HEAD to check the size and also we would not know checksum to compare without fetching the whole again.
Use temporary files and only rename after the final flush() and maybe checksum check.

To ensure we have loaded resources properly and flushed to disk - rename is much less flaky than dealing with the flush.

Checksums

We can even check the length & checksum from what was written to the disk before renaming to ensure write was successfull.

Jul 06 '21 20:07 pinkforest

I need to give a proper look at your comment tomorrow, but I think that our biggest concern was about the different hashtable and the waiting queue of the scrapper.

If we can make sure that every ongoing download/dom-tree fixing are finished properly, we can then artificially starve the thread pool and dump our internal state to a file

This avoid file "find the holes" of last scraping thing

Jul 06 '21 21:07 Skallwar

I don't believe the diff-local-comparison when continuing a run would add much overhead if properly implemented.

This compared to adding fragile state flush from non-portable trigger we can't always rely on.

There is a term called Crash only design / software that applies to this.

There are a lot of gotchas/problems when trying to rely on SIGINT and especially if waiting / persisting on File I/O.

Downloads especially should not be waited on if SIGINT has been issued.

Where the file is big most downloads over HTTP can be continued and some even exploit this for accelerated downloads by creating multiple instances that download different byte ranges, bypassing per TCP connection based throttling/rate limits.

Using signals is neither portable and signal may not have been emitted.

I could have destroyed the docker container I was running this but wanted to leave the data lying around I already sucked in.

For all server processes regardless of which OS they run on.

I've always adopted another approach e.g. incremental log / binlogs that are more robust & simpler and do the job most reliably without breaking portability.

State flush's result - or index - is essentially something we can get cheaply (when properly implemented) from the sole native output anyways that can be used to construct whatever state information needed later for comparison during the walk.

Yes if the site has 10,000 files we might need to re-build (or catch up from where it crashed by tracing incremental log) index by re-evaluating DOM and comparing if file exists for those nodes but I rather pay that overhead compared to say deadlocking process / shutdown and having to start from scratch because state information was corrupted or lost that does not enable recovery from crash.

Also interestingly we are already indexing by using the filesystem by using the directory hierarchy.

If we can guarantee that every file in the file system is complete it makes things simpler.

Perhaps when downloading say a 4GB file we might not want to restart and might want to indicate where we left in tempfile.

For our walk comparison, a simple output directory should do with maybe incremental log if we really need some metadata outside tempfiles considering there isn't multiple threads downloading the same file, no need to force-flush any special state information IMO at exit.

There is also USENIX paper from 2003 to see more about crash only design :)

Jul 06 '21 22:07 pinkforest

I think the original idea was to serialize the scrapper's state to disk and reload it afterwards. Any download that had not completed at the time would simply not be serialized and re-downloaded upon reload. I'm afraid that running through the already downloaded data is gonna be a really big hit compared to deserializing a single file, but I've never done Crash design haha.

There is valid concern that signals would not be portable. But we could also fetch user keypresses and get a Ctrl-c ourselves for our Windows friends (I have no idea how Ctrl-C works/if it even works on Windows)

Jul 07 '21 07:07 CohenArthur

Most people will run this in docker non-interactively .. or at least I do :)

Maybe I'll sketch implementation quick for a PR that perhaps better shows what I mean how performant it can be whilst being robust/portable the same time.

Jul 07 '21 08:07 pinkforest

suckit suckit copied to clipboard

Handle SIGINT to save Scraper state

suckit
suckit copied to clipboard