anemone
anemone copied to clipboard
Give focus_crawl a chance to access page body before discarding it
For site-specific crawlers, it's fair enough to use focus_crawl
like this:
anemone.focus_crawl do |page|
if page.doc
page.doc.search('.//a[@href]').map { |a| URI.parse(a[:href]) }
else
page.links
end
end
However when using the discard_page_bodies
option, page.doc
is nil
by the time we enter this block. In this pull request I've moved until after focus_crawl
has been called.
+1
@lankz I don't quite follow... I'm happy to accept this PR in the Medusa fork if you please could explain a bit better the use case and re-post it there :)
@brutuscat I stopped using Anemone a while ago, and can't seem to access the original documentation — but I believe the suggested use case for #focus_crawl
is something like this:
anemone.focus_crawl do |page|
page.links \
.select { |uri| uri.to_s =~ /productId=\d+/ }
end
which works just fine for simple crawls of well structured sites. I needed to crawl a few large, messy sites and the only way I could come up with to keep Anemone under control (crawl only the pages I was interested in, and keep it from blowing memory) was to focus only on links that appear under certain elements on the page using XPath and CSS selectors:
anemone.focus_crawl do |page|
if page.doc
# crawl only links found in the primary navigation bar
page.doc \
.search('.//nav/a[@href]') \
.map { |a| URI.parse(a[:href]) }
else
# sometimes +page.doc+ is empty, like when get a redirect
page.links
end
end
The problem I ran into is that, when using the discard_page_bodies
option, the page.doc
object has already been discarded by the time the #focus_crawl
block is called.
The change in this pull request is simple — delay the call to discard the page body (discard_doc!
) until after we've both extracted all the links (default Anemone functionality) and given #focus_crawl
a chance to run.