Whitelist start urls?

Open janpieper opened this issue 11 years ago • 1 comments

If you use #follow_links_like and the given start urls does not match the configured regexps, the crawler stops working. Is there a reason why the start urls aren't whitelisted?

start_urls = [ "http://www.example.com/foo/bar" ]
Polipus.crawler("dummy", start_urls, options) do |crawler|
  crawler.follow_links_like(/\/bar\/foo/)
end

The links on the start page match the given regexp.

Jun 26 '14 08:06 janpieper

At https://github.com/taganaka/polipus/blob/master/lib/polipus.rb#L163 we check for #should_be_visited?. This is to allow skipping an url when the policy has changed during the crawl session but the page was already queued.

#should_be_visited? https://github.com/taganaka/polipus/blob/master/lib/polipus.rb#L351 returns false when the link does not match the pattern.

#page_exists? already checks for page.user_data.p_seeded. Maybe we need to check for this value also in the case above.

Jun 26 '14 08:06 tmaier