spidr
spidr copied to clipboard
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Hi there, I was wondering if it would be possible to multithread the spidr gem? I don't know much about multithreading in ruby, but I believe only Ruby 1.9.x is...
Add `get`, `head`, `post`, `put`, etc methods to `Spidr::Agent` for when you do not want a Page object returned, just the raw response.
Currently `` tags are not taken into account and will send the spider to the wrong URL on pages with a base tag. With this patch, the spider correctly calculates...
I've just run into a situation where the reuse of an SSL session caused an exception and Spidr subsequently skipped the page. Currently, the exception is silently swallowed, so I...
To reduce lookup time in the `Spidr::Agent#queue`, we can store the URLs in a Hash of the unique `host:port` pair and the URL paths. This will also facilitate events for...
Discussed in IRC the other day. Noting it here for posterity. Could look into using http://github.com/alexdunae/css_parser for this, although there may be a more efficient path. ``` parser = CssParser::Parser.new...
Using `Addressable::URI` would allow spidr to handle IDN domains.