scrapy-selenium
scrapy-selenium copied to clipboard
does not respect DOWNLOAD_DELAY
The request is not passed to scrapy downloader, where the DOWNLOAD_DELAY is handle. There is no way to set a delay parameter within this middleware. Is it possible to add a similar support?
I came across this issue as well and found a workaround. (I think it's more of a hack than anything, so not sure if it's a good move to put in a PR.)
Basically, I added a sleep_time
parameter to http.py and then use that value with time.sleep()
in process_request
in middlewares.py.
http.py:
class SeleniumRequest(Request):
def __init__(self, wait_time=None, wait_until=None, screenshot=False, script=None, sleep_time=None, *args, **kwargs):
self.wait_time = wait_time
self.wait_until = wait_until
self.screenshot = screenshot
self.script = script
self.sleep_time = sleep_time
super().__init__(*args, **kwargs)
middlewares.py:
Add in import time
to top of file, and add the following to the process_request
function
if request.sleep_time:
time.sleep(request.sleep_time)
You also can extend SeleniumMiddleware
class BotDownloaderMiddleware(SeleniumMiddleware):
def process_request(self, request, spider):
if isinstance(request, SeleniumRequest):
delay = spider.settings.getint('DOWNLOAD_DELAY')
randomize_delay = spider.settings.getbool('RANDOMIZE_DOWNLOAD_DELAY')
if delay:
if randomize_delay:
delay = random.uniform(0.5 * delay, 1.5 * delay)
time.sleep(delay)
return super().process_request(request, spider)
and add to your spider
class NotebooksSpider(scrapy.Spider):
name = "notebooks"
custom_settings = {
'DOWNLOAD_DELAY': 1,
'RANDOMIZE_DOWNLOAD_DELAY': True,
'DOWNLOADER_MIDDLEWARES': {
'bot.middlewares.BotDownloaderMiddleware': 800,
# 'scrapy_selenium.SeleniumMiddleware': 800
},
'CONCURRENT_REQUESTS': 1,
'ITEM_PIPELINES': {
'bot.pipelines.DjangoSavePipeline': 300,
}
}
You also can extend SeleniumMiddleware
class BotDownloaderMiddleware(SeleniumMiddleware): def process_request(self, request, spider): if isinstance(request, SeleniumRequest): delay = spider.settings.getint('DOWNLOAD_DELAY') randomize_delay = spider.settings.getbool('RANDOMIZE_DOWNLOAD_DELAY') if delay: if randomize_delay: delay = random.uniform(0.5 * delay, 1.5 * delay) time.sleep(delay) return super().process_request(request, spider)
I think respecting Scrapy config values should be the default behaviour.
For me, the current behaviour is unexpected.