requestium icon indicating copy to clipboard operation
requestium copied to clipboard

External webdriver outside Requestium and Selenium Wire

Open vladiscripts opened this issue 2 years ago • 5 comments

In recent commits, an interesting feature has appeared - the use of a Selenium webdriver outside Requestium. But to do this, I have to do some cumbersome acrobatics in my scripts, completely replacing the _start_chrome_browser() and _start_chrome_headless_browser() methods, if I want to use some of my webdriver_options.

It seems to me that it would be sufficient to simply replace the webdriver.Chrome dependency in the RequestiumChrome class definition. That is, in my script I have now written:

import seleniumwire.webdriver    # I want to use this webdriver

# Here I replaced common `webdriver` with `seleniumwire.webdriver`
class RequestiumChrome(requestium.requestium.DriverMixin, seleniumwire.webdriver.Chrome):
            pass
requestium.requestium.RequestiumChrome = RequestiumChrome

# after I call a regular Requestium instance
self.s = requestium.Session(
            webdriver_path='chromedriver', browser='chrome', default_timeout=60,
            webdriver_options={'arguments': [
                '--start-maximized',
                '--window-size=1200,1000',
            ], # 'binary_location': "/usr/bin/google-chrome"
            })

But this is also a cumbersome construct in my code.

So I think the external webdriver can be added to requestium.py with something like:

webdriver = driver_class   # can assign this via an initatial argument
class RequestiumChrome(requestium.requestium.DriverMixin, webdriver):
            pass

vladiscripts avatar Jul 14 '22 16:07 vladiscripts

Maybe add support for https://github.com/wkeeling/selenium-wire to Requestium? Or even replace selenium.webdriver with seleniumwire.webdriver?

Selenium-wire is interesting in that it can get Fetch/XHR requests particular. Because of which we have to use Selenium instead of the usual requests Python, on sites with dynamic page rendering.

For example, this is how I get the content of an XHR request on some page (example from https://github.com/wkeeling/selenium-wire) :

for r in driver.requests:
    if r.path == '/siteendpoint' and r.params.get('listing') == 'products':
        response = r.response
        body = decode(response.body, response.headers.get('Content-Encoding', 'identity'))
        j = json.loads(body)
        break

Also, there is a support for undetected-chromedriver (https://github.com/wkeeling/selenium-wire#bot-detection). Which also initializes webdriver from a different class and uses a different ChromeOptions() class. This is for the question above.

vladiscripts avatar Jul 14 '22 17:07 vladiscripts

I hadn't heard of selenium-wire before, pretty nifty package, I'll have to check it out.

I don't have time to make major changes to requestium right now, but if you wanted to make a PR to swap it in, I'd be open to taking a look.

lordjabez avatar Jul 24 '22 04:07 lordjabez

We need to decide what changes to make. It depends on your requestium policy.

  • Can completely replace from selenium import webdriver with from seleniumwire import webdriver. The selenium-wire project looks solid, it's 4 years old. https://github.com/wkeeling/selenium-wire/graphs/contributors
  • Or add an initialization argument like: requestium.Session(driver_class=[non-initiated webdriver class here]]). Perhaps this would allow using seleniumwire.undetected-chromedriver.webdriver .
  • Or add two True/False switch arguments: requestium.Session(use_seleniumwire_webdriver=True, use_undetected_chromedriver=False).

vladiscripts avatar Jul 24 '22 20:07 vladiscripts

I lean towards the second of those options.

lordjabez avatar Jul 24 '22 20:07 lordjabez

I have added pull requests. But looking at the code, I saw the accumulated problems.

  1. There it is desirable the initiation of the webdriver to separate from the Session class. The Session class does not need hidden webdriver initiation methods, like _start_phantomjs_browser() and others, as well as the code block in which choose which method to create the driver. With that out of the way, can optimize how the class is created - with driver, driver_class, or just with browser arguments. Now there is "spaghetti code".

  2. On separating these methods, I find that the undefined properties self.proxies and self.headers are used in the _start_phantomjs_browser() method. The Session class help says "Header and proxy transfer is done only one time when the driver process starts." However, there are no such arguments in the __init__() method. Therefore an exception should be raised, on an attempt to get data from uninitialized methods?

  3. A driver argument was added, in the README.md example of using this argument, it is used to specify the Firefox browser driver. However, does Requestium work with Firefox?.. Can it pass cookies between requests <=> webdriver, for example? The README.md says: "Features: Supports Chrome and PhantomJS."

  4. Requestiom is nice with the DriverMixin class, which adds very useful helpers methods to the webdriver. In theory (I haven't tested) they should work in Firefox and other browsers supported by Selenium. Therefore, it is desirable to extract from the Session class a method that connects this class to an externally initialized webdriver. (This is related to points 1 and 3.)

I think that there should be a class like RequestiumWebdriver, in which to take out the initialization of the webdriver, or use the provided webdriver. I don't know what to do with self.proxy and self.header for PhantomJS, they can be left as is, they are broken anyway. And make the RequestiumBrowser class, which will combine the made webdriver and DriverMixin. Moreover, methods from DriverMixin should be taken in the way that is now added to Session.__init__(), instead of inheritance, as in the RequestiumChrome and RequestiumPhantomJS classes. And delete these classes.

Looking at the current code the concept of Requestium is not clear, how it will develop further, and therefore whether it can be used in projects that go beyond "draft scraps for personal use only".

vladiscripts avatar Jul 27 '22 18:07 vladiscripts

Closed by https://github.com/tryolabs/requestium/pull/59

lordjabez avatar Sep 11 '22 04:09 lordjabez