grab-site icon indicating copy to clipboard operation
grab-site copied to clipboard

infinite recursion on offsite links?

Open TheTechRobo opened this issue 3 years ago • 3 comments

how would I go about enabling that?

TheTechRobo avatar Jul 27 '21 15:07 TheTechRobo

How deep do you really want to go?

A middle ground ideally would be to support a configurable depth for crawls to avoid finding every page on the internet.

Unless that's your thing... You can try to use it as is and by what it says, it seems like it should do that, but I have not considered that a reasonable thing for a single process to be responsible for and haven't experimented with that much beyond basic/plaintext sites.

Personally, I always run with --no-offsite-links (avoid following links to a depth of 1 on other domains). It will crawl immediate pre-requisite resources but not any links found past that. Then I'll set up a whole crawl of the site and read the index for off-site URLs. Then take the list and divide up those sites into separate crawls. You could call it a system.

What did you do? What should happen? What happened?

acrois avatar Aug 23 '21 05:08 acrois

I never really found a solution. It isn't a much-needed feature for me really, would just be nice to have a configurable depth, including "inf" for infinite.

TheTechRobo avatar Aug 24 '21 14:08 TheTechRobo

The depth is infinite by default, but grab-site hardcodes the --span-hosts-allow wpull option, which prevents recursion on off-site pages. So you need to reset that to the default empty value. Maybe --wpull-args='--span-hosts --span-hosts-allow ""' would do the trick. Not sure if there are further reasons that would prevent the recursion though.

JustAnotherArchivist avatar Dec 04 '21 04:12 JustAnotherArchivist