crawl icon indicating copy to clipboard operation
crawl copied to clipboard

A concurrent crawler that minimizes memory use. Output suitable for use with BigQuery.

Results 10 crawl issues
Sort by recently updated
recently updated
newest added

This adds a way for potential users without a locally installed and properly configured Go environment to build and install binaries. If you're not interested in producing official binary releases...

This change moves the concept of depth to the currently active URL, to enable continuous crawling. Before this change, the crawl speed is only as quick as the slowest URL...

The possibilities for duplicate content checking using SHA512 is limited. What do you think of swapping that out for Simhash so more nuanced comparisons of content would be possible? The...

enhancement

Analogous to `crawl schema`.

enhancement

If for some reason a site blocks its own sitemap with a robots.txt file, the crawler should respect that and not request the sitemaps in sitemap mode.

bug

The Config file is the most error-prone part of the process from the user's perspective. However, we can't really get around this — there are just a lot of choices...

enhancement