anemone Give focus_crawl a chance to access page body before discarding it

Give focus_crawl a chance to access page body before discarding it

Open lankz opened this issue 11 years ago • 3 comments

For site-specific crawlers, it's fair enough to use focus_crawl like this:

anemone.focus_crawl do |page|
  if page.doc
    page.doc.search('.//a[@href]').map { |a| URI.parse(a[:href]) }
  else
    page.links
  end
end

However when using the discard_page_bodies option, page.doc is nil by the time we enter this block. In this pull request I've moved until after focus_crawl has been called.

Jan 27 '14 07:01 lankz

May 08 '14 12:05 tmaier

@lankz I don't quite follow... I'm happy to accept this PR in the Medusa fork if you please could explain a bit better the use case and re-post it there :)

Dec 14 '14 17:12 brutuscat

@brutuscat I stopped using Anemone a while ago, and can't seem to access the original documentation — but I believe the suggested use case for #focus_crawl is something like this:

anemone.focus_crawl do |page|
  page.links \
      .select { |uri| uri.to_s =~ /productId=\d+/ }
end

which works just fine for simple crawls of well structured sites. I needed to crawl a few large, messy sites and the only way I could come up with to keep Anemone under control (crawl only the pages I was interested in, and keep it from blowing memory) was to focus only on links that appear under certain elements on the page using XPath and CSS selectors:

anemone.focus_crawl do |page|
  if page.doc
    # crawl only links found in the primary navigation bar
    page.doc \
        .search('.//nav/a[@href]') \
        .map { |a| URI.parse(a[:href]) }
  else
    # sometimes +page.doc+ is empty, like when get a redirect
    page.links
  end
end

The problem I ran into is that, when using the discard_page_bodies option, the page.doc object has already been discarded by the time the #focus_crawl block is called.

The change in this pull request is simple — delay the call to discard the page body (discard_doc!) until after we've both extracted all the links (default Anemone functionality) and given #focus_crawl a chance to run.

Dec 15 '14 00:12 lankz

anemone anemone copied to clipboard

Give focus_crawl a chance to access page body before discarding it

anemone
anemone copied to clipboard