grab-site
grab-site copied to clipboard
Enhancement idea: delay/concurrency by regex
It would be handy to have a way to set the delay and concurrency for a job more granularly, for example by regex of the URL.
For example, if I grab-site http://foo.tld/ and they have a slow server and ban a lot, I might set 1 concurrency with a 500-1000ms delay. Consider that http://foo.tld/ has 200,000 pages, and each of those pages links to 10 files hosted at http://cdn.bar.tld/, which is very fast, doesn't ban, and would be fine with 5 concurrency and a 0ms delay. The crawl will wait for 750ms times 2,200,000 responses, for a little over 19 days spent waiting, whereas if the delay could be set for 500-1000 except for URLs matching ^http://cdn.bar.tld/.+$, the crawl would at best take an average of 750ms times 200,000 responses plus 0ms times 2,000,000 responses, for a little under 2 days spent waiting.
This is a good idea. I think wpull already passes the request into wait_time
but my hooks don't look at the object.
This is now possible with a custom hook: upgrade to grab-site 1.4.0, then use a custom hook with:
wait_time_grabsite = wpull_hook.callbacks.wait_time
def wait_time(seconds, url_info, record_info, response_info, error_info):
url = url_info["url"]
if url.startswith("http://foo.tld/"):
return wait_time_grabsite(seconds, url_info, record_info, response_info, error_info)
return 0
wpull_hook.callbacks.wait_time = wait_time
That uses the configured delay for foo.tld and 0 for everything else.
(A hook can also be installed on an existing crawl by overwriting custom_hooks.py
, but it will crash the crawl if anything is wrong.)
How do I add custom hooks now??