scrapy-selenium
scrapy-selenium copied to clipboard
Run Scrapy with more than 1 browser.
Hello,
There are any ideas how to modify Selenium Middleware to separate ongoing requests in several browsers window?
I am also curious about this since concurrency is really important for my app.
There is a closed ticket about this here I believe.
https://github.com/clemfromspace/scrapy-selenium/issues/13
@borys25ol one possible way is by subclassing the middleware and statically configuring it. Not the prettiest solution but is should be straight forward to implement.
class MyMiddleware1(SeleniumMiddleware):
@classmethod
def from_crawler(cls, crawler):
middleware = cls(
driver_name='firefox',
driver_executable_path='which(geckodriver)',
driver_arguments=[],
)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
class MyMiddleware2(SeleniumMiddleware):
# ...
I've figured out a working solution to this issue that fits my needs, but is a bit involved (involving the need for an async driver pool). If this project is still being maintained, I'd be down to submit a PR for it if I have some free time and there's still some interest.
Because scrapy uses twisted, I found the key to this is that the middleware's process_request()
method can also return a twisted.internet.defer.Deferred
with a response in its callback argument.
Hi @Flushot !! I'm facing the same problem, I'm interesest in your solution. How did you do?
Thanks a lot!
Andreu Jové
@AndreuJove
The gist of it is that because process_request()
can either return a standard response object or a twisted deferred (and because scrapy is itself built on twisted), the handling of downloads can be done in an asynchronous way. This opens up an opportunity for the downloader middleware to manage a pool of drivers asynchronously (and allows for concurrent requests to be sent to that pool).
The code I wrote has deviated from the version in this repo quite a bit, so I may either fork or try to find time to re-integrate. Here's a high level overview:
- Request is made by the spider.
-
Downloader middleware handles it with
process_request()
(which will ultimately handle the request asynchronously by returning a deferred instead of a response object). - Downloader middleware attempts to check driver out of a fixed-size async driver pool.
- Driver pool waits on a semaphore and either tries to reuse a previously checked in driver, or starts a new
webdriver.Remote
or a localwebdriver.Firefox
/webdriver.Chrome
session depending on whether Selenium Grid is being used (which I registered under a new config key calledSELENIUM_HUB_URL
). - Once a driver is available/allocated, the deferred is resolved.
- If anything fails up to this point, the deferred may be rejected.
- If the failure was temporary (e.g. timeout), the deferred is resolved with another request (with priority lowered), so that Scrapy will re-schedule a retry.
- When the request is finished being processed (or an exception was raised), the following will happen:
- Another spider middleware will check the driver back into the pool (via
process_spider_output
orprocess_spider_exception
respectively) so that other pending requests can use it again.- Note: If this extra middleware doesn't handle the finalization (and the deferred is immediately resolved by the downloader middleware), the spider won't be able to interact with the
request.meta['driver']
object reliably.
- Note: If this extra middleware doesn't handle the finalization (and the deferred is immediately resolved by the downloader middleware), the spider won't be able to interact with the
- Depending on configured policy: When the driver is checked back into the pool, it may be reused for another requests (and is "cleared" by navigating to
about:blank
) or the driver isquit()
so that the next request will cause a new allocation.
- Another spider middleware will check the driver back into the pool (via
Hopefully that clears things up.
Dear Flushot,
Thank you for your explanation. Do you have this code in a repository, I think it will be more easy for me to understand.
Thanks a lot again,
Andreu
Unfortunately I don't yet. The code I have is private (and is coupled to private libraries). I'd be down to fork and integrate my changes when I get some free time.
@borys25ol one possible way is by subclassing the middleware and statically configuring it. Not the prettiest solution but is should be straight forward to implement.
class MyMiddleware1(SeleniumMiddleware): @classmethod def from_crawler(cls, crawler): middleware = cls( driver_name='firefox', driver_executable_path='which(geckodriver)', driver_arguments=[], ) crawler.signals.connect(middleware.spider_closed, signals.spider_closed) return middleware class MyMiddleware2(SeleniumMiddleware): # ...
How do you connect the new middleware classes with the spider?