crusty-core
crusty-core copied to clipboard
Implement non concurrent crawler
For broad web crawling we probably do not need any concurrency within a single job... which means we can save up a bunch of resources and annoy site owners less... Additionally I'm considering using this in a so called "breeder" - a dedicated non-concurrent web crawler which purposes are
- download && parse robots.txt, while
- resolving redirects
- resolving additional DNS requests(if any) as long as it falls within the same
addr_key, see https://github.com/let4be/crusty-core/issues/14 headindex page to figure out if there are any redirects(if allowed by robots.txt)- Jobs that resolved all DNS(within our restrictions) and successfully
HEADindex page are considered "breeded"
all jobs extracted from JobQ will be added to a breeder first and only then to a typical web crawler(if they survive the breeding process) with a StaticDnsResolver(breeder and regular web crawler will have quite different rules and settings)