Implement non concurrent crawler

Open let4be opened this issue 4 years ago • 0 comments

For broad web crawling we probably do not need any concurrency within a single job... which means we can save up a bunch of resources and annoy site owners less... Additionally I'm considering using this in a so called "breeder" - a dedicated non-concurrent web crawler which purposes are

download && parse robots.txt, while
resolving redirects
resolving additional DNS requests(if any) as long as it falls within the same addr_key, see https://github.com/let4be/crusty-core/issues/14
head index page to figure out if there are any redirects(if allowed by robots.txt)
Jobs that resolved all DNS(within our restrictions) and successfully HEAD index page are considered "breeded"

all jobs extracted from JobQ will be added to a breeder first and only then to a typical web crawler(if they survive the breeding process) with a StaticDnsResolver(breeder and regular web crawler will have quite different rules and settings)

Jun 28 '21 21:06 let4be