browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

[Request] Add option for sleep interval between page crawls to avoid captchas/rate limits

Open Fs00 opened this issue 3 years ago • 2 comments

Hello! I'm trying to crawl a huge website which starts asking for captchas after crawling a few hundreds of pages in a short amount of time. Since setting workers=1 is not enough to avoid hitting the captcha "rate limit", I'm here to ask for the addition of an option to specify a custom sleep interval (e.g. 5 seconds) which makes the crawler do nothing for the specified amount of time before crawling the next page. Youtube-dl has a similar option too, and in my experience it has been useful in other similar circumstances. Thanks!

Fs00 avatar Mar 24 '22 18:03 Fs00

yeah, that makes sense and is easy to add. Are you thinking it would sleep after every page, or after every N pages?

ikreymer avatar Mar 24 '22 18:03 ikreymer

I think that sleeping after every page should be good enough. Having N workers that sleep after every page provides a similar behavior to sleeping after N pages.

Fs00 avatar Mar 24 '22 19:03 Fs00