Whitelist start urls?
If you use #follow_links_like and the given start urls does not match the configured regexps, the crawler stops working. Is there a reason why the start urls aren't whitelisted?
start_urls = [ "http://www.example.com/foo/bar" ]
Polipus.crawler("dummy", start_urls, options) do |crawler|
crawler.follow_links_like(/\/bar\/foo/)
end
The links on the start page match the given regexp.
At https://github.com/taganaka/polipus/blob/master/lib/polipus.rb#L163 we check for #should_be_visited?. This is to allow skipping an url when the policy has changed during the crawl session but the page was already queued.
#should_be_visited? https://github.com/taganaka/polipus/blob/master/lib/polipus.rb#L351 returns false when the link does not match the pattern.
#page_exists? already checks for page.user_data.p_seeded. Maybe we need to check for this value also in the case above.