Squidwarc icon indicating copy to clipboard operation
Squidwarc copied to clipboard

Feature request: set browser accept language

Open peterk opened this issue 6 years ago • 6 comments

When running Squidwarc on server hosts in other countries, websites will sometimes present the UI in the language relating to the IP address range of the server host. (E.g. when I run archiving of Facebook pages from a server in Germany it will present the Facebook interface in German). If it was possible to set the chrome accept language parameter from the job json it would be possible to give more control to the archiver.

peterk avatar Dec 05 '18 22:12 peterk

This is a good suggestion for an option, @peterk. http://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html provides some examples of weirdness in language detection via IA submission. It would be interesting to test this from different IPs and Accept-Language values to see if the effects are replicable.

machawk1 avatar Dec 06 '18 16:12 machawk1

This issue is up next on the big list of things to do

N0taN3rd avatar Feb 05 '19 05:02 N0taN3rd

know this is a semi-long time coming but once the chrome-remote-interface-extra-intergration branch is merged this and a hole lot more things will be possible using Squidwarc

PS spread the word, you dont need puppeteer to simply use the CDP https://github.com/N0taN3rd/chrome-remote-interface-extra ;)

N0taN3rd avatar Feb 10 '19 07:02 N0taN3rd

Hey y'all I finally got node-warc and chrome-remote-interface-extra in a position to support this feature request.

I am thinking the API for this is as follows:

You can, like you do for supplying a user script that is run before WARC generation, supply a function that is passed as its only argument the page object of chrome-remote-interface-extra, puppeteer or the chrome-remote-interface client object in order to customize the behavior of the browser.

Example when using chrome-remote-interface-extra (type definitions for the arguments of pageOrClient.setGeolocation is not valid JS but provided for your convince)

module.exports = async function chromeCustomizer (pageOrClient) {
    // set the download path of files downloaded by the browser
    await pageOrClient.setDownloadBehavior('<path to new downloads folder>')

    // set the Accept-Language HTTP header
    await pageOrClient.setAcceptLanguage('<new language>')

    // set navigator.platform
    await pageOrClient.setNavigatorPlatform('<new platform>')

    // set new geolocation
    await pageOrClient.setGeolocation({longitude: number, latitude: number, accuracy: (number|undefined)})
}

For both chrome-remote-interface-extra and puppeteer the connection to the browser tab is found on pageOrClient._client if you need more fine tuned customization and as always please consult the CDP documentation for details.

Please let me know if there are any suggestions or concerns about how to make this as user friendly as possible.

N0taN3rd avatar Feb 24 '19 06:02 N0taN3rd

Documentation on the upcoming chrome-remote-interface-extra integration https://n0tan3rd.github.io/chrome-remote-interface-extra/

N0taN3rd avatar Feb 25 '19 07:02 N0taN3rd

Hey y'all, If you want to start test running things today this feature is living in the chrome-remote-interface-extra-intergration branch. The entry point to make changes like this is the chromeCustomizer.js file.

Puppeteer CI is failing currently and chrome-remote-interface-extra's CI is good except for an pesky net::ERR_NAME_NOT_RESOLVED vs net::ERR_NAME_RESOLUTION_FAILED error message that happens on travis for some reason and using google chrome canary.... CI link: https://travis-ci.com/N0taN3rd/chrome-remote-interface-extra

Full documentation for the more you can do with this library than with puppeteer is found here https://n0tan3rd.github.io/chrome-remote-interface-extra/.

I'm gona add redis frontier support and frontier customization functions before this feature gets merged into master (I'm tired of in memory frontiers)

N0taN3rd avatar Mar 12 '19 04:03 N0taN3rd