grab-site Enhancement idea: delay/concurrency by regex

Enhancement idea: delay/concurrency by regex

Open ethus3h opened this issue 9 years ago • 3 comments

It would be handy to have a way to set the delay and concurrency for a job more granularly, for example by regex of the URL.

For example, if I grab-site http://foo.tld/ and they have a slow server and ban a lot, I might set 1 concurrency with a 500-1000ms delay. Consider that http://foo.tld/ has 200,000 pages, and each of those pages links to 10 files hosted at http://cdn.bar.tld/, which is very fast, doesn't ban, and would be fine with 5 concurrency and a 0ms delay. The crawl will wait for 750ms times 2,200,000 responses, for a little over 19 days spent waiting, whereas if the delay could be set for 500-1000 except for URLs matching ^http://cdn.bar.tld/.+$, the crawl would at best take an average of 750ms times 200,000 responses plus 0ms times 2,000,000 responses, for a little under 2 days spent waiting.

Dec 15 '15 00:12 ethus3h

This is a good idea. I think wpull already passes the request into wait_time but my hooks don't look at the object.

Dec 15 '15 15:12 ivan

This is now possible with a custom hook: upgrade to grab-site 1.4.0, then use a custom hook with:

wait_time_grabsite = wpull_hook.callbacks.wait_time
def wait_time(seconds, url_info, record_info, response_info, error_info):
	url = url_info["url"]
	if url.startswith("http://foo.tld/"):
		return wait_time_grabsite(seconds, url_info, record_info, response_info, error_info)
	return 0

wpull_hook.callbacks.wait_time = wait_time

That uses the configured delay for foo.tld and 0 for everything else.

(A hook can also be installed on an existing crawl by overwriting custom_hooks.py, but it will crash the crawl if anything is wrong.)

Nov 09 '17 09:11 ivan

How do I add custom hooks now??

Nov 19 '21 02:11 TheTechRobo

grab-site grab-site copied to clipboard

Enhancement idea: delay/concurrency by regex

grab-site
grab-site copied to clipboard