Adopt WebDriverAbstract as a solution for active (JavaScript) websites
Hello everyone,
A few weeks ago I came across the well-known problem that rss-bridge doesn't work for some websites. This is always the case when the website loads some content via XMLHttpRequest (XHR) and / or changes it using JavaScript. After some experimentation and thinking about the problem, I came to the conclusion that it can only be solved using a web browser. Coincidentally, I already had a lot of experience with Selenium and the WebDriver and decided to come up with a solution.
I now suggest that the WebDriverAbstract class be imported into the project, temporarily in its own branch. As examples I have written two bridges that use these.
The following problems still need to be solved:
- [ ] I didn't commit my
composer.lockbecause the changes look random and I didn't want to invest time (yet) to understand composer. - [ ] I have added the php-webdriver dependencies to
bootstrap.php, but they are not complete. They look messy and unsorted because of the dependencies between them. I suggest to use the autoloader. - [ ] All bridges that depend on WebDriverAbstract would create high(er) load on a multi-user (open) server. Maybe there should be an option to disable the class instead of each bridge using it.
- [ ] Because loading web pages through a browser is such a heavier load, it would make sense to stop scraping as soon as an element is in a cache (assuming the elements are sorted chronologically). Is there already a cache for elements that have already been scraped? How can I use it? I looked around the code but didn't find any examples of this or documentation.
- [ ] What do you miss in the documentation for WebDriverAbstract?
Kind regards,
Holger
hello thanks for issue and pr. i can help you make this happen
I didn't commit my
composer.lockbecause the changes look random and I didn't want to invest time (yet) to understand composer.
At the moment, end users do not use composer and/or composer.json/composer.lock. End users simply extract rss-bridge into a folder and everything works out of the box. This has been the standard way of php since the 90s.
The current usage of composer is to fetch developer dependencies such as phpunit and phpcs (linter).
Current solution to thirdparty dependencies is to commit them to repo in vendor folder.
* [ ] I have added the php-webdriver dependencies to `bootstrap.php`, but they are not complete. They look messy and unsorted because of the dependencies between them. I suggest to use the autoloader.
We can commit this dependency to vendor folder manually, or we can start to require end users to use composer for installation.
* [ ] All bridges that depend on WebDriverAbstract would create high(er) load on a multi-user (open) server. Maybe there should be an option to disable the class instead of each bridge using it.
Should have a setting to disable this. Or some way to quickly disable these resource heavy bridges.
* [ ] Because loading web pages through a browser is such a heavier load, it would make sense to stop scraping as soon as an element is in a cache (assuming the elements are sorted chronologically). Is there already a cache for elements that have already been scraped? How can I use it? I looked around the code but didn't find any examples of this or documentation.
The output of bridges are cached by default 1h. Configurable by bridge with const CACHE_TIMEOUT.
In addition, we have cache for more fine-grained stuff. It is a field on BridgeAbstract and can be accessed like this:
$this->set('foo', 'bar');
$val = $this->get('foo');
See https://github.com/RSS-Bridge/rss-bridge/blob/master/lib/CacheInterface.php
* [ ] What do you miss in the documentation for WebDriverAbstract?
Lets postpone docs. Lets make it work first, then make good docs.
Please can you write linux/debian non-docker instructions on how to set up the selenium server on localhost.
Please can you write linux/debian non-docker instructions on how to set up the selenium stuff.
For development? That's easy: just install the Debian package chromium-driver (+ chromium). Since it is the base of Chrome, it is fully compatible.
A full Selenium Grid for production would be more complicated.
recent commits in master allows install of php-webdriver/webdriver using composer:
composer require --update-no-dev php-webdriver/webdriver
or if you also want dev deps:
composer require php-webdriver/webdriver
$ sudo apt install chromium-driver
$ chromedriver --port=4444 --verbose
At the moment, end users do not use composer and/or composer.json/composer.lock. End users simply extract rss-bridge into a folder and everything works out of the box. This has been the standard way of php since the 90s.
The current usage of composer is to fetch developer dependencies such as phpunit and phpcs (linter).
Current solution to thirdparty dependencies is to commit them to repo in vendor folder.
Ah, I assumed that this was due to performance considerations. I've been using PHP autoloaders since the early 2010s and I think it's a good idea, especially to avoid dependency hell.
I am against checking in libraries, partly because of the duplication, but also because then you are responsible for keeping this code up to date, especially if there are security holes.
We can commit this dependency to vendor folder manually, or we can start to require end users to use composer for installation.
How do end users use rss-bridge? I would think that most download the Docker image, a much smaller proportion build the Docker image themselves, and the even smaller remainder use the repository via a web server. This rest should also be able to install the necessary packages locally and invoke composer manually.
Should have a setting to disable this. Or some way to quickly disable these resource heavy bridges.
At the moment I don't know enough of the source code to write a patch for it.
The output of bridges are cached by default 1h. Configurable by bridge with
const CACHE_TIMEOUT.In addition, we have cache for more fine-grained stuff. It is a field on
BridgeAbstractand can be accessed like this:
Thank you. I will try this for GULPProjekteBridge.
I think it would be very nice with a function e.g. getContentsWithWebDriver() that opens page, wait for load and them dumps html. Example:
<?php
class BearBlogBridge extends BridgeAbstract
{
const NAME = 'BearBlog (bearblog.dev)';
public function collectData()
{
$dom = getContentsWithWebDriver('https://herman.bearblog.dev/blog/');
foreach ($dom->find('.blog-posts li') as $li) {
$a = $li->find('a', 0);
$this->items[] = [
'title' => $a->plaintext,
'uri' => 'https://herman.bearblog.dev' . $a->href,
];
}
}
}
This very basic function could solve lots of problems with dynamic DOM.
Is this possible?
How do end users use rss-bridge? I would think that most download the Docker image, a much smaller proportion build the Docker image themselves, and the even smaller remainder use the repository via a web server. This rest should also be able to install the necessary packages locally and invoke composer manually.
Im not actually sure but most probably use the docker image from docker hub.
Should have a setting to disable this. Or some way to quickly disable these resource heavy bridges.
At the moment I don't know enough of the source code to write a patch for it.
Dont worry about it.
This very basic function could solve lots of problems with dynamic DOM.
Is this possible?
Yes, that would certainly make things easier, but unfortunately, as far as I know, there are no methods for doing this. The DOM is like a fluid in motion. Turning it back into HTML would be possible, but also difficult.
Scraping a webpage using the WebDriver is simply different.
This very basic function could solve lots of problems with dynamic DOM.
Is this possible?
Following this thread with great interest. I'm no programmer, but have you seen this article?
https://www.zenrows.com/blog/selenium-php#parse-the-data-you-want
There is code for scaper.php which does exactly what is being asked here -- returns the html of the page. (e.g., $html = $driver->getPageSource();)
There is code for scaper.php which does exactly what is being asked here -- returns the html of the page. (e.g., $html = $driver->getPageSource();)
getPageSource() returns the HTML source loaded into the browser, but not the DOM. The DOM turned back into HTML is what we are interested in. The rest of the article does what I did.
Thanks for clarifying.
I'm attempting to try this out, but unfortunately I'm new to this.
- I have php-fpm in a docker container
- selenium standalone in docker
- I've download the the latest branch of the codebase on web server
I try to run one of these bridges and get the error:
Type: Error Code: 0 Message: Class 'Facebook\WebDriver\Remote\RemoteWebDriver' not found File: lib/WebDriverAbstract.php Line: 104
I gather this related to the dependency points above. I realize this isn't a support thread, but any documentation or help would be appreciated.
Type: Error Code: 0 Message: Class 'Facebook\WebDriver\Remote\RemoteWebDriver' not found File: lib/WebDriverAbstract.php Line: 104
That is a dependency error. The main branch is currently broken in regards of WebDriver. I will write a fix as soon I have time for that. But for a quick fix you can append this to the $files array lib/bootstrap.php (after line 27):
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverSearchContext.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriver.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/JavaScriptExecutor.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverHasInputDevices.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/RemoteWebDriver.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverCapabilities.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/WebDriverCapabilityType.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/WebDriverBrowserType.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverPlatform.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/DesiredCapabilities.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Chrome/ChromeOptions.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Firefox/FirefoxOptions.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverCommandExecutor.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/DriverCommand.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/HttpCommandExecutor.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Local/LocalWebDriver.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Firefox/FirefoxDriver.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/WebDriverCommand.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/PhpWebDriverExceptionInterface.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/Internal/UnexpectedResponseException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/Internal/WebDriverCurlException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/WebDriverResponse.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverOptions.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/ExecuteMethod.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/RemoteExecuteMethod.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverWindow.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverWait.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverExpectedCondition.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverBy.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/JsonWireCompat.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverElement.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Internal/WebDriverLocatable.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/RemoteWebElement.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/WebDriverException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/NoSuchElementException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/FileDetector.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/UselessFileDetector.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Support/IsElementDisplayedAtom.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Support/ScreenshotHelper.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverTimeouts.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/UnknownErrorException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverPoint.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverMouse.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/RemoteMouse.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/MoveTargetOutOfBoundsException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Interactions/WebDriverActions.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverKeyboard.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/RemoteKeyboard.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverAction.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Interactions/WebDriverCompositeAction.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Interactions/Internal/WebDriverMouseAction.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Interactions/Internal/WebDriverMoveToOffsetAction.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Interactions/Internal/WebDriverCoordinates.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/TimeoutException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/StaleElementReferenceException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverKeys.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/ElementNotInteractableException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/InvalidArgumentException.php',
EDIT: Created the directory structure needed by pulling from php-webdriver's git. I was able to execute the bridge (needed to comment out a line calling formatItemTimestamp as "IntlDateFormatter" couldn't be found.
Works! :) Noticed that broader level feed caching also works. This will be a nice step forward for this project, not only to help with JS sites, but also it helps considerably with sites that use cloudflare. I'll have fun playing with this now and as it matures. Thanks again for your continued work on this!
Thanks, however additional errors.
.../../vendor/php-webdriver/webdriver/lib/WebDriverSearchContext.php): failed to open stream: No such file or directory.....
This folder doesn't appear in master: /vendor/php-webdriver/webdriver/
EDIT: Created the directory structure needed by pulling from php-webdriver's git. I was able to execute the bridge (needed to comment out a line calling formatItemTimestamp as "IntlDateFormatter" couldn't be found.
Ah, that wouldn't have been necessary. Just a few comments up, @dvikan posted this command:
composer require --update-no-dev php-webdriver/webdriver
And IntlDateFormatter is a standard php class and should work right out of the box. At least it did on my machine:
https://www.php.net/manual/en/class.intldateformatter.php
IntlDateFormatter (php extension) might be missing out of the box on e.g. debian:
apt install php-intl
More info in composer.json:
},
"suggest": {
"php-webdriver/webdriver": "Required for Selenium usage",
"ext-memcached": "Allows to use memcached as cache type",
"ext-sqlite3": "Allows to use an SQLite database for caching",
"ext-zip": "Required for FDroidRepoBridge",
"ext-intl": "Required for OLXBridge"
},
Feedback:
I'm no developer, so I can only test and provide feedback :)
-
Per the front page of this project, you should be able to download the zip, extract to your web server and execute. (That's what I did). If there are dependencies, (e.g., the php-webdriver folder), then suggest it is added to the project. Otherwise, best to shift to a docker-only model where all of this could be handled automatically?
-
As I was attempting to create a bridge, I wanted to use navigation within the browser (e.g., back button). To make that work, I add to add to bootstrap.ini: DIR . '/../vendor/php-webdriver/webdriver/lib/WebDriverNavigationInterface.php', DIR . '/../vendor/php-webdriver/webdriver/lib/WebDriverNavigation.php',
-
This will not be a fix for Cloudflare unfortunately, at least not initially. Cloudflare is able to detect selenium sessions and block.
-
Using the button "ATOM" (to create the feed) is broken and errors out. MRSS does work.
-
As stated in this thread, while there is caching at the feed-level, there is no caching at the lower levels (e.g., code of the page). This is impactful if you were to build a full-text rss feed.
Other: Videocardz site -- recently it has gone behind cloudflare for several regions in the world. While this won't address that issue, if you are able to access, I was able to do a full-text feed (perhaps useful as an example):
https://gist.github.com/dr0id123/0d14ff0e2a666fd6e8c949091c7dea45
Hope this helps!
This was recently merged in master so that users can test it out. Let's call it a beta testing phase. When things are looking good ill add instructions to README.
In the meantime, all that is needed to test this is composer require --update-no-dev php-webdriver/webdriver.
I'd advise against manually modifying files in bootstrap.php or the vendor folder, unless you know what you are doing.
I think I have a way that presents a path forward (in some cases):
- I wanted to find a way around Cloudflare (where possible)
- Using a full browser addresses other issues
- Need for article-level caching especially on full-text
- Works with the various existing interfaces in rssBridge
After looking around, I came across this project: https://github.com/FlareSolverr/FlareSolverr FlareSolverr uses Selenium combined with a modified driver to help bypass Cloudflare all wrapped up in docker. It's very hard to bypass Cloudflare and it will depend on the site, configuration, etc, as we know. That said, it will work depending on the site.
Here's my working example:
https://gist.github.com/dr0id123/2863a821b8bf35864f977a3eb9dface4
Using this method can be quite slow (FlareSolverr takes time get around protections), and expensive (perf) otherwise due to selenium. Hence the caching really helps and happy the rssBridge API provides for it.
I'm no coder, I'm sure you both could have done this in a better way.....
(Works quite nicely for me!!)
I'm no coder, I'm sure you both could have done this in a better way.....
That depends on how you look at it. In my opinion, this amounts to a game of cat and mouse in which both sides can only lose. I don't feel like investing energy into it. Web technologies work best when they are designed to be interactive. I can live with it if webmasters of modern websites no longer know what an RSS feed is. Then I'll just build one myself. But if someone sets up a walled garden, then I don't have to free the content from it at any price.
I added a path for active content sites because that's the modern web. But I have no interest in countermeasures for defensive strategies.
EDIT -- Am I wrong about "does not* solve the issue of sites that you have created bridges and this interface for" -- I'm truly new to this so sorry for basic questions....
- Using curl to pull from the site in your bridge just provides a bunch of what looks like to me, javascript. No content of the page.
- Using flaresolverr is different though. It provides JSON -- see: https://gist.github.com/dr0id123/e82b45b6aa7e281cebf6c94fddfd494c
In the JSON, the full response is within it (e.g., content of the page article). Does this mean that using the JSON output from flaresolverr is a way to get the DOM and parse it accordingly to create a feed? Is this due to it using selenium and pulling directly out of the browser?
I added a path for active content sites because that's the modern web. But I have no interest in countermeasures for defensive strategies.
Totally understandable. It is literally a cat and mouse game with tools like Flaresolverr, which I may add does not solve the issue of sites that you have created bridges and this interface for.
Ultimately, I've come to conclude there is no "all in one" answer to all of this. rssBridge provides a menu of approaches to choose from depending on the site and need. The webdriver you've added to the project is another tool in the belt and a much needed addition.
@dvikan
I've taken some time to educate myself on 'dynamic DOM' and the complications that exist today with rssBridge. The work done in this pull request is awesome, but with many things code-related, there are multiple ways to get to the same objective. The implementation in this thread I think is awesome for advances cases. For other cases, I do think the project would benefit from an integration with Flaresolverr (or at least guidance with examples), not because it attempts some clouldflare mitigation, but because it uses selenium under the hood which means dynamic content is processed.
To give a clear example, using the 'scalable capital' example, here's is another way that does not require the use of php-webdriver and keeps things simple:
https://gist.github.com/dr0id123/17faa66b365f37546d93bd5f033759e3
yes this is similar to what i wrote earlier in this thread:
$dom = getContentsWithWebDriver('https://herman.bearblog.dev/blog/');
Essentially a function which returns html string (or parsed DOM) for a dynamic page.
FlareSolverr is a Proxy server to bypass Cloudflare protection, which incidentally parses a dynamic DOM using selenium as you write.
your example is actually also possible like this (using getContents):
$header = ['Content-type: application/json'];
$opts = [CURLOPT_POSTFIELDS => json_encode($data_array)];
$html = getContents($url, $header, $opts);
@hleskien This is from a stackoverflow answer:
In
[PHP Selenium WebDriver](https://github.com/php-webdriver/php-webdriver)
you can get page source like this:
$html = $driver->getPageSource();
Or get HTML of the element like this:
// innerHTML if you need HTML of the element content
$html = $element->getDomProperty('outerHTML');
Thank you! This helps me to simplify elsewhere. gist updated.
This works:
$header = ['Content-type: application/json'];
$opts = [CURLOPT_POSTFIELDS => json_encode($data)];
$response_data = json_decode(getContents($api_url, $header, $opts), true);
$html = $response_data["solution"]["response"];
return $html;
$header = ['Content-type: application/json']; $opts = [CURLOPT_POSTFIELDS => json_encode($data_array)]; $html = getContents($url, $header, $opts);
how is this working for ya? is it still good?
@dvikan
Preface:
- My webserver is a docker container, with PHP-FPM in another docker container. PHP-FPM images does not come with composer.
Feedback:
I am not specifically using webdriver at this point for a couple of reasons:
- It requires dependencies of which I don't use composer (see above) which leads to dependency problems. At some point I will switch to the docker image of rss-bridge where I hope these dependency issues will be addressed.
- Atom feed was broken the last time I tried it with WebDriver.
- WebDriver - I don't have a need to use this implementation. My needs are: a. Get around Cloudflare where possible. b. Be able to create feeds; and where possible including sites with dynamic DOM. c. Create full-text feeds where possible, including sites that obfuscate div tags.
- To address my needs (a) and (b) above, I am using Flaresolverr as discussed in this thread. No additional code dependencies (libraries) needed to work. It is a very good no-cost solution at the moment for cloudflare, plus dynamic DOM is processed. This works with all of the other functionality of rss-bridge (caching, feedexpander, xpath).
- To address (c), I've been able to import "readability" to create full-text feeds where I need it as an alternative to css, xpath directly (and it works great!).
Therefore, my suggestions to this project:
- Incorporate 'light' support for Flaresolverr. That means option an the config file to specify connection parameters, and a wrapper function -- and if not add some documentation with an example for others.
$html = getContentsWithFlareSolverr('https://herman.bearblog.dev/blog/');
- Introduce 'light' support for readability -- see here: https://github.com/fivefilters/readability.php. By 'light', I mean addressing the dependency (vendors folder?) or other. As mentioned, it works very nicely!
Hope this helps.