grab-site
grab-site copied to clipboard
Ignore errors and keep crawling
Hi there.
When I am archiving sites, sometimes grab-site will encounter a URL that it cannot connect to (i.e. connecting to the page times out). From my observation, whenever this happens, the scraping operation immediately errors out and quits, even though I have more URLs left to crawl.
For example: upon encountering a page that will time out: this is printed.
ERROR Fetching ‘http://some-site.com:83/_somefolder/’ encountered an error: Connect timed out.
http://some-site.com:83/_somefolder/
ERROR Fetching ‘http://some-site.com:83/_somefolder/’ encountered an error: Connect timed out.
Finished grab some-hash https://some-site.com/ with exit code 4
Note that it is exit code 4, and not 0, i.e. there is an error.
Is there a way to ignore errors, and keep crawling?
grab-site does keep crawling on errors like a connection error. If it is exiting too early, it is probably because it has run through the entire queue (perhaps because it didn't discover any URLs on the same domain?)
Thanks. What does exit code 4 mean here?
grab-site effectively becomes a wpull process, so that would be https://wpull.readthedocs.io/en/master/api/errors.html#wpull.errors.ExitStatus.network_failure - meaning that some request had a network failure. It doesn't mean that it exited immediately because of it.
Okay thanks a lot for this explanation.
I have another unrelated question: I've been using grab-site on Javascript-heavy websites, particularly Wix-powered websites. However, grab-site doesn't render the Javascript UI elements correctly.
Is there a way to archive these sites properly?
One possible solution I was thinking: I've been trying pywb's website recording functions. It seems like when I visit http://localhost:8080/my-web-archive/record/http://example.com/
with my browser, then the Javascript elements are saved properly, but if I visit with wget/curl, they aren't.
Is there a similar way visit/render sites with a browser, using grab-site?
Yeah, that's a known issue. IIRC grab-site doesn't extract links from JavaScript, so they won't be saved. The JS itself will be saved as it is a page requisite but none of the URLs it actually contacts.
You could use some sort of proxy with grab-site. Or you could use another tool, like https://github.com/internetarchive/brozzler.
@TheTechRobo thanks! I'm new to this so not too sure what you mean by using some sort of proxy with grab-site, or how using a proxy would solve this, would you be so kind to elaborate?
I mean a proxy that would parse and/or run JvaaScript (and then add the links to the finished html or put the links in a text file that can be used with grab-site -i
). I don't know of any but if you can find one (or code one), it might work :D
Just realised - adding links to the HTML is a no-go, since we probably want clean archives.
But just a textfile with urls would probably be fine. :smile_cat: