crusty-core icon indicating copy to clipboard operation
crusty-core copied to clipboard

Implement non concurrent crawler

Open let4be opened this issue 4 years ago • 0 comments

For broad web crawling we probably do not need any concurrency within a single job... which means we can save up a bunch of resources and annoy site owners less... Additionally I'm considering using this in a so called "breeder" - a dedicated non-concurrent web crawler which purposes are

  • download && parse robots.txt, while
  • resolving redirects
  • resolving additional DNS requests(if any) as long as it falls within the same addr_key, see https://github.com/let4be/crusty-core/issues/14
  • head index page to figure out if there are any redirects(if allowed by robots.txt)
  • Jobs that resolved all DNS(within our restrictions) and successfully HEAD index page are considered "breeded"

all jobs extracted from JobQ will be added to a breeder first and only then to a typical web crawler(if they survive the breeding process) with a StaticDnsResolver(breeder and regular web crawler will have quite different rules and settings)

let4be avatar Jun 28 '21 21:06 let4be