anemone Add support for crawling subdomains

Add support for crawling subdomains

Open alexspeller opened this issue 13 years ago • 3 comments

Merge changes to support subdomain crawling from https://github.com/runa/anemone/commit/91559bde052956cfc40ae62678ec2a61574cf928

Aug 03 '11 11:08 alexspeller

This feature is very useful. I think anemone should also support for printing out the external links, just print out it but not scan it in deep. The link checker tool XENU (http://home.snafu.de/tilman/xenulink.html) has this feature.

Nov 07 '11 11:11 MaGonglei

MaGonglei: It is very simple to gather external links using Anemone, and comparably simple to actually check these links to verify they are valid, etc. The 'on_every_page' block is very helpful in this regard.

If you'd like some code that does exactly what you are asking, I could send an example your way.

Nov 08 '11 16:11 wokkaflokka

Hi,wokkaflokka,thanks for your reply. I think I know what you mean,but I prefer to have this feature when I initialize the anemone crawl like : Anemone.crawl("http://www.example.com",:external_links => false) do |anemone| .... end

Because if I use the "on_every_page" block to search the external links (e.g. "page.doc.xpath '//a[@href]') ,it seemed cost too much CPU and Memorys.

If I'm wrong,give me the example.

Thanks.

Nov 14 '11 00:11 MaGonglei

anemone anemone copied to clipboard

Add support for crawling subdomains

anemone
anemone copied to clipboard