grab-site icon indicating copy to clipboard operation
grab-site copied to clipboard

Ignore errors and keep crawling

Open TowardMyth opened this issue 2 years ago • 8 comments

Hi there.

When I am archiving sites, sometimes grab-site will encounter a URL that it cannot connect to (i.e. connecting to the page times out). From my observation, whenever this happens, the scraping operation immediately errors out and quits, even though I have more URLs left to crawl.

For example: upon encountering a page that will time out: this is printed.

ERROR Fetching ‘http://some-site.com:83/_somefolder/’ encountered an error: Connect timed out.
http://some-site.com:83/_somefolder/
ERROR Fetching ‘http://some-site.com:83/_somefolder/’ encountered an error: Connect timed out.
Finished grab some-hash https://some-site.com/ with exit code 4

Note that it is exit code 4, and not 0, i.e. there is an error.

Is there a way to ignore errors, and keep crawling?

TowardMyth avatar Jul 24 '21 06:07 TowardMyth

grab-site does keep crawling on errors like a connection error. If it is exiting too early, it is probably because it has run through the entire queue (perhaps because it didn't discover any URLs on the same domain?)

ivan avatar Jul 24 '21 06:07 ivan

Thanks. What does exit code 4 mean here?

TowardMyth avatar Jul 24 '21 06:07 TowardMyth

grab-site effectively becomes a wpull process, so that would be https://wpull.readthedocs.io/en/master/api/errors.html#wpull.errors.ExitStatus.network_failure - meaning that some request had a network failure. It doesn't mean that it exited immediately because of it.

ivan avatar Jul 24 '21 06:07 ivan

Okay thanks a lot for this explanation.

I have another unrelated question: I've been using grab-site on Javascript-heavy websites, particularly Wix-powered websites. However, grab-site doesn't render the Javascript UI elements correctly.

Is there a way to archive these sites properly?

One possible solution I was thinking: I've been trying pywb's website recording functions. It seems like when I visit http://localhost:8080/my-web-archive/record/http://example.com/ with my browser, then the Javascript elements are saved properly, but if I visit with wget/curl, they aren't.

Is there a similar way visit/render sites with a browser, using grab-site?

TowardMyth avatar Jul 24 '21 07:07 TowardMyth

Yeah, that's a known issue. IIRC grab-site doesn't extract links from JavaScript, so they won't be saved. The JS itself will be saved as it is a page requisite but none of the URLs it actually contacts.

You could use some sort of proxy with grab-site. Or you could use another tool, like https://github.com/internetarchive/brozzler.

TheTechRobo avatar Jul 24 '21 12:07 TheTechRobo

@TheTechRobo thanks! I'm new to this so not too sure what you mean by using some sort of proxy with grab-site, or how using a proxy would solve this, would you be so kind to elaborate?

TowardMyth avatar Jul 24 '21 17:07 TowardMyth

I mean a proxy that would parse and/or run JvaaScript (and then add the links to the finished html or put the links in a text file that can be used with grab-site -i). I don't know of any but if you can find one (or code one), it might work :D

TheTechRobo avatar Jul 24 '21 18:07 TheTechRobo

Just realised - adding links to the HTML is a no-go, since we probably want clean archives.

But just a textfile with urls would probably be fine. :smile_cat:

TheTechRobo avatar Jul 28 '21 17:07 TheTechRobo