browsertrix-crawler proxy support

Hey,

from what I can see, the docker container has no support for (socks) proxies, so all outgoing requests are started from the machine running the docker container?

Puppeteer (and chromium) can be started with args: ["--proxy-server=socks5://localhost:1234"]

It seems like puppeteer-cluster allows passing puppeteer args: https://github.com/thomasdondorf/puppeteer-cluster/issues/368#issuecomment-780117594

But those are not configurable in browsertrix-crawler:

https://github.com/webrecorder/browsertrix-crawler/blob/c5494be653566c4352cea298a9b2d9bac9bb2a4e/crawler.js#L196-L206

Sep 15 '21 14:09 phiresky

I see that in https://github.com/webrecorder/browsertrix-crawler/blob/c5494be653566c4352cea298a9b2d9bac9bb2a4e/crawler.js#L186

it actually takes the proxy configuration from the environment, so that somewhat answers my question - but it hardcodes a http proxy, disallowing socks proxies. It would also be good to be able to supply a list of proxies with the page requests distributed randomly along them.

Sep 15 '21 14:09 phiresky

Yes, the way the archiving works is that Chrome is set to the pywb instance, which captures HTTPs traffic as it passes through, so this proxy should not be changed. However, pywb itself does support SOCKS5 proxying, which can be configured via environment variable set on the container, as per: https://pywb.readthedocs.io/en/latest/manual/configuring.html?highlight=SOCKS#socks-proxy-for-live-web

How were you thinking of using the proxy? Proxy rotation would happy to be handled at the pywb level also, which is doable. Can you say more about the use case you have in mind?

Sep 17 '21 21:09 ikreymer

Ah, I didn't see that capturing works by actually MITMing the traffic, i thought it hooked into some chrome API like archiveweb.page. So that's great it's already configurable, maybe could be added to the docs here?

How were you thinking of using the proxy? Proxy rotation would happy to be handled at the pywb level also, which is doable. Can you say more about the use case you have in mind?

Not really anything special, just basically archiving / scraping a larger number of web pages and distributing the traffic across multiple IPs in some fairly simple way (e.g. I use ssh to create 10 socks proxies, then let pywb distribute either each page load or each chrome instance onte the proxies). That way I can also run the chrome instances etc on a powerful machine and the outgoing traffic comes from elsewhere.

Sep 19 '21 08:09 phiresky

One more issue. When setting a proxy with

docker run -e SOCKS_HOST=localhost -e SOCKS_PORT=15257 ...

The fetching fails with errors like

 in SOCKSProxyManager
    raise InvalidSchema("Missing dependencies for SOCKS support.")
requests.exceptions.InvalidSchema: Missing dependencies for SOCKS support.

Additionally, these errors are not actually shown at all unless --logging pywb is passed, instead browsertrix-crawler reports all URLs as succeeded, and pages.jsonl has entries like

{
  "id": "...",
  "url": "...",
  "title": "Pywb Error"
}

I'm not sure why these are reported as successes?

Also, the sitemap is not fetched via the proxy I think.

Oct 07 '21 13:10 phiresky

The socks issue can probably be fixed by replacing requests with requests[socks] in pywb/requirements.txt

Oct 07 '21 13:10 phiresky