puppeteer-extra
puppeteer-extra copied to clipboard
Run all Puppeteer commands in an Isolated World
Attempting to address this issue below and looking for feedback and ideas on any potential ways to achieve this more cleanly: https://github.com/berstend/puppeteer-extra/issues/209
The goal is to have Puppeteer run every command in a Isolated Worlds to avoid detection scripts being able to monitor execution.
The concept write up is here (sorry, needed too much detail to include within the issue text): https://github.com/prescience-data/harden-puppeteer
The only way I can figure out how to achieve this is by modifying the vanilla Puppeteer files directly in the node_modules
folder, so hoping someone with experience writing Puppteer-Extra plugins can advise a way to do with with a plugin instead.
Thanks!
Have you looked/tried https://github.com/ds300/patch-package ?
@brunogaspar thanks! Just re-did the concept as a patch, much much easier to follow compared to the previous way!
My main concern is whether this will somehow mess up plugins like Extra-Stealth by accidently running all of their modifications within the isolated world (ie effectively disabling them)
Just a matter of trying, have you tried with the latest Puppeteer?
Yes, it does work (on 1.19.0, will be updating it 2.1.0) - I'm just trying to think of unknown unknowns etc. Do you know of any ways to test Extra-Stealth features to make sure they are all active?
hmm you mean, to determine if the stuff that the stealth plugin does is still being applied? If that's it, i suppose you can try to mimic what the unit tests for the stealth plugin does.
If that's not it, please elaborate a bit more and i'll try to help you out.
My main concern is whether this will somehow mess up plugins like Extra-Stealth by accidently running all of their modifications within the isolated world (ie effectively disabling them)
This should become apparent immediately when using your patches and running yarn test
in the stealth plugin repo. :-)
Haven't looked more closely at isolated worlds so far but is it similar to what happens in Chrome Extensions and Content Scripts? If so then this would have an effect, as the Puppeteer scripts couldn't access the site's local window
object (only DOM) without injecting another script in the site.
Haven't looked more closely at isolated worlds so far but is it similar to what happens in Chrome Extensions and Content Scripts? If so then this would have an effect, as the Puppeteer scripts couldn't access the site's local
window
object (only DOM) without injecting another script in the site.
Yes it's the same as how the Content Scripts work from memory.
What I've tried to do is to isolate only the commands sent by the user, meaning the rest of Puppeteer should run normally, but any detection scripts will be unable to monitor your commands, other than to see the outcome in the DOM.
The trade off is that any global libraries you might be expecting to have access to, you'll need to include directly in the script rather than look for them on window._____
, and naturally that means if you need to interact with the site's custom scripts directly you might not be able to do this (have not tested this though).
Ok so running the tests in puppeteer-extra-plugin-stealth
dumps a bunch of these errors with the patch applied:
Rejected promise returned by test. Reason:
Error {
message: `Evaluation failed: ReferenceError: fpCollect is not definedΓÉè
at jquery.js:1:18`,
}
Which would be expected if fpCollect
is defined outside the isolated world, but the test seems to be testing from "inside" Puppeteer, whereas I think a more accurate test would be inspecting it from "outside", as a detection script would?
But don't we intentionally want to run in the same context as the site's JS in order to be able to access and modify it?
Let's make a simpler test case to help understand this:
await page.evaluateOnNewDocument(() => {
delete Object.getPrototypeOf(navigator).webdriver
})
edit, and then navigating to https://bot.sannysoft.com/ and see if Webdriver
is missing
Would this work with your patched files? If not (similar to Content Script isolation in Chrome) we'd need to inject another JS script into the site/DOM with the actual payload, which is trivial to detect (MutationObservers, Content Security Policies).
I don't believe so because this is the page.evaluateOnNewDocument
function:
/**
* @param {Function|string} pageFunction
* @param {!Array<*>} args
*/
async evaluateOnNewDocument(pageFunction, ...args) {
const source = helper.evaluationString(pageFunction, ...args);
await this._client.send('Page.addScriptToEvaluateOnNewDocument', { source });
}
You can see it is sending the command directly to the _client
rather than passing through FrameManager
(which is where the isolated world exists).
The isolated world is set up to catch things like page.evaluate()
eg:
/**
* @param {Function|string} pageFunction
* @param {!Array<*>} args
* @return {!Promise<*>}
*/
async evaluate(pageFunction, ...args) {
return this._frameManager.mainFrame().evaluate(pageFunction, ...args);
}
Which has been overridden here https://github.com/prescience-data/harden-puppeteer/blob/ba202cc0a422b257c26f023fbaafd41f7ae48157/patches/puppeteer%2B1.19.0.patch#L86 to:
return this._frameManager.isolatedWorld().evaluate(pageFunction, ...args);
The _mainFrame()
is the dangerous one where the detection scripts exist. Running the interaction commands like evaluate()
, type()
, etc inside the isolated world means the detection scripts cannot monitor them.
edit, and then navigating to https://bot.sannysoft.com/ and see if
Webdriver
is missing
Passed:
Is this what you were expecting for webdriver?
(edit: FYI that is on the 1.19 patch, 2.1.0 is not working properly) (edit 2: Just updated the 2.x patch to 2.1.1 and is now working)
Also passes:
FingerprintJS
https://fingerprintjs.com/demo
Are You Headless?
https://arh.antoinevastel.com/bots/areyouheadless
SocialNetsDefender
http://anonymity.space/hellobot.php
Distil Networks
http://promos.rtm.com
Ok so I can confirm that the patch works as intended.
I've made a test for it here that uses Vastel's execution monitoring technique to figure out if the host site has any visibility into the patched context:
Puppeteer Test: https://github.com/prescience-data/puppeteer-botcheck/blob/b6848845b8b5887608784caa2fe7a078db866e9b/Botcheck.js#L45 Host Monitoring Execution: https://github.com/prescience-data/prescience-data.github.io/blob/master/execution-monitor.html
URL of the live test: https://prescience-data.github.io/execution-monitor.html
Here's the differences between unpatched and patched:
Unpatched
Patched
You can see that the patched version only detects the inserted elements (which was left deliberately unisolated to allow user to inject scripts into the main context (ie all the extra-stealth modifications).
However, anything other than that is running isolated and outside the security scope of any bot detection script.
Naturally they would be able to observe changes you make to the DOM, but only the outcome, not how the execution is occurring.