Squidwarc
Squidwarc copied to clipboard
Feature request: set browser accept language
When running Squidwarc on server hosts in other countries, websites will sometimes present the UI in the language relating to the IP address range of the server host. (E.g. when I run archiving of Facebook pages from a server in Germany it will present the Facebook interface in German). If it was possible to set the chrome accept language parameter from the job json it would be possible to give more control to the archiver.
This is a good suggestion for an option, @peterk. http://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html provides some examples of weirdness in language detection via IA submission. It would be interesting to test this from different IPs and Accept-Language
values to see if the effects are replicable.
This issue is up next on the big list of things to do
know this is a semi-long time coming but once the chrome-remote-interface-extra-intergration branch is merged this and a hole lot more things will be possible using Squidwarc
PS spread the word, you dont need puppeteer to simply use the CDP https://github.com/N0taN3rd/chrome-remote-interface-extra ;)
Hey y'all I finally got node-warc and chrome-remote-interface-extra in a position to support this feature request.
I am thinking the API for this is as follows:
You can, like you do for supplying a user script that is run before WARC generation, supply a function that is passed as its only argument the page object of chrome-remote-interface-extra, puppeteer or the chrome-remote-interface client object in order to customize the behavior of the browser.
Example when using chrome-remote-interface-extra (type definitions for the arguments of pageOrClient.setGeolocation is not valid JS but provided for your convince)
module.exports = async function chromeCustomizer (pageOrClient) {
// set the download path of files downloaded by the browser
await pageOrClient.setDownloadBehavior('<path to new downloads folder>')
// set the Accept-Language HTTP header
await pageOrClient.setAcceptLanguage('<new language>')
// set navigator.platform
await pageOrClient.setNavigatorPlatform('<new platform>')
// set new geolocation
await pageOrClient.setGeolocation({longitude: number, latitude: number, accuracy: (number|undefined)})
}
For both chrome-remote-interface-extra and puppeteer the connection to the browser tab is found on pageOrClient._client
if you need more fine tuned customization and as always please consult the CDP documentation for details.
Please let me know if there are any suggestions or concerns about how to make this as user friendly as possible.
Documentation on the upcoming chrome-remote-interface-extra integration https://n0tan3rd.github.io/chrome-remote-interface-extra/
Hey y'all, If you want to start test running things today this feature is living in the chrome-remote-interface-extra-intergration branch. The entry point to make changes like this is the chromeCustomizer.js file.
Puppeteer CI is failing currently and chrome-remote-interface-extra's CI is good except for an pesky net::ERR_NAME_NOT_RESOLVED vs net::ERR_NAME_RESOLUTION_FAILED error message that happens on travis for some reason and using google chrome canary.... CI link: https://travis-ci.com/N0taN3rd/chrome-remote-interface-extra
Full documentation for the more you can do with this library than with puppeteer is found here https://n0tan3rd.github.io/chrome-remote-interface-extra/.
I'm gona add redis frontier support and frontier customization functions before this feature gets merged into master (I'm tired of in memory frontiers)