cobweb
cobweb copied to clipboard
Web crawler with very flexible crawling options. Can either use standalone or can be used with resque to perform clustered crawls.
Hi, I'm getting the following error when I try to use cobweb from command line. Here is the full stack trace: ``` # /Users/gustavo/.rvm/gems/ruby-2.3.1@site-shift/gems/cobweb-1.1.0/bin/cobweb:13:in `block in ': undefined method `banner'...
I have added code in lib/cobweb.rb for authenticating proxies. Along with this i have added a feature to rotate proxies for each request, i.e "proxy_shift" method inside the 'Cobweb' class...
External urls are not treated as external if they match the cache. A test should be done when retrieving from the cache to make sure that all criteria are checked...
Slop 4 introduces [breaking changes](https://github.com/leejarvis/slop#upgrading-from-version-3) and doesn't support git style sub-commands anymore. Cobweb should either depend on Slop 3.6.0 or update cobweb executable to the new syntax.
It's possible to som how start, stop crawling website or pause from comand line?
Hello, As far as I can see, the generated hash for each page doesn't include the "depth" information, that is to say how many clicks from the homepage each page...
Sometimes Addressable::URI mangles urls to something incorrect. See: https://github.com/sporkmonger/addressable/issues/160 When cobweb crawls one of these, the correct URL is put into redis, but when normalized it hits a 404. Examples...
When the redirect limit is hit it kills the crawl. The RedirectError is thrown but doesn't seem to be trapped, it seems to be thrown for each subsequent call into...
Hi, first I want to say thank you for sharing this crawler and for the work you put in it. Here is our experience with it and thoughts for improvements....
Seem to have issues with connections to redis sometimes under load, need to give ability to specify your own redis and and check handling of dropped connections.