scrapy-selenium icon indicating copy to clipboard operation
scrapy-selenium copied to clipboard

does not respect DOWNLOAD_DELAY

Open misssprite opened this issue 5 years ago • 3 comments

The request is not passed to scrapy downloader, where the DOWNLOAD_DELAY is handle. There is no way to set a delay parameter within this middleware. Is it possible to add a similar support?

misssprite avatar Mar 22 '19 09:03 misssprite

I came across this issue as well and found a workaround. (I think it's more of a hack than anything, so not sure if it's a good move to put in a PR.)

Basically, I added a sleep_time parameter to http.py and then use that value with time.sleep() in process_request in middlewares.py.

http.py:

class SeleniumRequest(Request):
    def __init__(self, wait_time=None, wait_until=None, screenshot=False, script=None, sleep_time=None, *args, **kwargs):
        self.wait_time = wait_time
        self.wait_until = wait_until
        self.screenshot = screenshot
        self.script = script
        self.sleep_time = sleep_time

        super().__init__(*args, **kwargs)

middlewares.py:

Add in import time to top of file, and add the following to the process_request function

if request.sleep_time:
    time.sleep(request.sleep_time)

oehrlein avatar May 30 '19 04:05 oehrlein

You also can extend SeleniumMiddleware

class BotDownloaderMiddleware(SeleniumMiddleware):
    def process_request(self, request, spider):
        if isinstance(request, SeleniumRequest):
            delay = spider.settings.getint('DOWNLOAD_DELAY')
            randomize_delay = spider.settings.getbool('RANDOMIZE_DOWNLOAD_DELAY')
            if delay:
                if randomize_delay:
                    delay = random.uniform(0.5 * delay, 1.5 * delay)
                time.sleep(delay)
        return super().process_request(request, spider)

and add to your spider

class NotebooksSpider(scrapy.Spider):
    name = "notebooks"
    custom_settings = {
        'DOWNLOAD_DELAY': 1,
        'RANDOMIZE_DOWNLOAD_DELAY': True,
        'DOWNLOADER_MIDDLEWARES': {
            'bot.middlewares.BotDownloaderMiddleware': 800,
            # 'scrapy_selenium.SeleniumMiddleware': 800
        },
        'CONCURRENT_REQUESTS': 1,
        'ITEM_PIPELINES': {
            'bot.pipelines.DjangoSavePipeline': 300,
        }
    }

cherijs avatar Aug 05 '20 11:08 cherijs

You also can extend SeleniumMiddleware

class BotDownloaderMiddleware(SeleniumMiddleware):
    def process_request(self, request, spider):
        if isinstance(request, SeleniumRequest):
            delay = spider.settings.getint('DOWNLOAD_DELAY')
            randomize_delay = spider.settings.getbool('RANDOMIZE_DOWNLOAD_DELAY')
            if delay:
                if randomize_delay:
                    delay = random.uniform(0.5 * delay, 1.5 * delay)
                time.sleep(delay)
        return super().process_request(request, spider)

I think respecting Scrapy config values should be the default behaviour.

For me, the current behaviour is unexpected.

tristanlatr avatar Aug 26 '20 15:08 tristanlatr