spidr
spidr copied to clipboard
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Add optional Logging/debug output to `Spidr::Agent`. `Agent#initialize` should accept a `logger` option for passing in custom [Logger](https://rubydoc.info/stdlib/logger/Logger) compatible objects. It should also support a `logging: true|false` option, which initializes `@logger`...
Switch from using Ruby's `net/http` to using [async-http](https://github.com/socketry/async-http#readme). This would allow for easy connection pooling and concurrent requests, without the overhead of threads and mutexes.
Hi, It seems that when $_SERVER['REQUEST_URI'] or similar is used AND the web server is configured to return custom error pages (including 200 statuses), Spidr ends up in an infinite...
Howdy. We just had a big debugging session centered around redirects, and it turned out that they were redirecting from non-www to a www.domain URL, so spidr silently failed, finding...
Howdy! Just wondering if i'm implementing this right. I need to follow redirects, and there doesnt seem to be an option toggle so I tried implementing it this way. It...
_Side note_: First of all thank you for an awesome gem. Over the past years and I've reached for this gem numerous times for various purposes big and small, its...
__Overview__ - Supports index files - Supports gzipped files - Tries common Sitemap XML locations - With `robots: true` will try to fetch sitemap locations from `/robots.txt` - Each found...
I'm opening this issue for the sole reason to say, thank you so much for your hard work 🙏. also as a side note, I run the specs against ruby...
Automatically detecting and parsing `/sitemap.xml` might be a good way to cut down on spidering depth.