scrapy-selenium icon indicating copy to clipboard operation
scrapy-selenium copied to clipboard

Run Scrapy with more than 1 browser.

Open borys25ol opened this issue 3 years ago • 9 comments

Hello,

There are any ideas how to modify Selenium Middleware to separate ongoing requests in several browsers window?

borys25ol avatar Sep 27 '20 19:09 borys25ol

I am also curious about this since concurrency is really important for my app.

Tobeyforce avatar Sep 28 '20 16:09 Tobeyforce

There is a closed ticket about this here I believe.

https://github.com/clemfromspace/scrapy-selenium/issues/13

MapsGraphsCharts avatar Dec 29 '20 15:12 MapsGraphsCharts

@borys25ol one possible way is by subclassing the middleware and statically configuring it. Not the prettiest solution but is should be straight forward to implement.

class MyMiddleware1(SeleniumMiddleware):

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls(
            driver_name='firefox',
            driver_executable_path='which(geckodriver)',
            driver_arguments=[],
        )

        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)

        return middleware


class MyMiddleware2(SeleniumMiddleware):
    # ...

0b11001111 avatar Feb 22 '21 07:02 0b11001111

I've figured out a working solution to this issue that fits my needs, but is a bit involved (involving the need for an async driver pool). If this project is still being maintained, I'd be down to submit a PR for it if I have some free time and there's still some interest.

Because scrapy uses twisted, I found the key to this is that the middleware's process_request() method can also return a twisted.internet.defer.Deferred with a response in its callback argument.

Flushot avatar Apr 23 '21 02:04 Flushot

Hi @Flushot !! I'm facing the same problem, I'm interesest in your solution. How did you do?

Thanks a lot!

Andreu Jové

AndreuJove avatar Jul 05 '21 15:07 AndreuJove

@AndreuJove

The gist of it is that because process_request() can either return a standard response object or a twisted deferred (and because scrapy is itself built on twisted), the handling of downloads can be done in an asynchronous way. This opens up an opportunity for the downloader middleware to manage a pool of drivers asynchronously (and allows for concurrent requests to be sent to that pool).

The code I wrote has deviated from the version in this repo quite a bit, so I may either fork or try to find time to re-integrate. Here's a high level overview:

  • Request is made by the spider.
  • Downloader middleware handles it with process_request() (which will ultimately handle the request asynchronously by returning a deferred instead of a response object).
  • Downloader middleware attempts to check driver out of a fixed-size async driver pool.
  • Driver pool waits on a semaphore and either tries to reuse a previously checked in driver, or starts a new webdriver.Remote or a local webdriver.Firefox/webdriver.Chrome session depending on whether Selenium Grid is being used (which I registered under a new config key called SELENIUM_HUB_URL).
  • Once a driver is available/allocated, the deferred is resolved.
    • If anything fails up to this point, the deferred may be rejected.
    • If the failure was temporary (e.g. timeout), the deferred is resolved with another request (with priority lowered), so that Scrapy will re-schedule a retry.
  • When the request is finished being processed (or an exception was raised), the following will happen:
    • Another spider middleware will check the driver back into the pool (via process_spider_output or process_spider_exception respectively) so that other pending requests can use it again.
      • Note: If this extra middleware doesn't handle the finalization (and the deferred is immediately resolved by the downloader middleware), the spider won't be able to interact with the request.meta['driver'] object reliably.
    • Depending on configured policy: When the driver is checked back into the pool, it may be reused for another requests (and is "cleared" by navigating to about:blank) or the driver is quit() so that the next request will cause a new allocation.

Hopefully that clears things up.

Flushot avatar Jul 07 '21 00:07 Flushot

Dear Flushot,

Thank you for your explanation. Do you have this code in a repository, I think it will be more easy for me to understand.

Thanks a lot again,

Andreu

AndreuJove avatar Jul 07 '21 10:07 AndreuJove

Unfortunately I don't yet. The code I have is private (and is coupled to private libraries). I'd be down to fork and integrate my changes when I get some free time.

Flushot avatar Jul 09 '21 04:07 Flushot

@borys25ol one possible way is by subclassing the middleware and statically configuring it. Not the prettiest solution but is should be straight forward to implement.

class MyMiddleware1(SeleniumMiddleware):

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls(
            driver_name='firefox',
            driver_executable_path='which(geckodriver)',
            driver_arguments=[],
        )

        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)

        return middleware


class MyMiddleware2(SeleniumMiddleware):
    # ...

How do you connect the new middleware classes with the spider?

vionwinnie avatar Jul 28 '22 17:07 vionwinnie