proxy support
Hey,
from what I can see, the docker container has no support for (socks) proxies, so all outgoing requests are started from the machine running the docker container?
Puppeteer (and chromium) can be started with args: ["--proxy-server=socks5://localhost:1234"]
It seems like puppeteer-cluster allows passing puppeteer args: https://github.com/thomasdondorf/puppeteer-cluster/issues/368#issuecomment-780117594
But those are not configurable in browsertrix-crawler:
https://github.com/webrecorder/browsertrix-crawler/blob/c5494be653566c4352cea298a9b2d9bac9bb2a4e/crawler.js#L196-L206
I see that in https://github.com/webrecorder/browsertrix-crawler/blob/c5494be653566c4352cea298a9b2d9bac9bb2a4e/crawler.js#L186
it actually takes the proxy configuration from the environment, so that somewhat answers my question - but it hardcodes a http proxy, disallowing socks proxies. It would also be good to be able to supply a list of proxies with the page requests distributed randomly along them.
Yes, the way the archiving works is that Chrome is set to the pywb instance, which captures HTTPs traffic as it passes through, so this proxy should not be changed. However, pywb itself does support SOCKS5 proxying, which can be configured via environment variable set on the container, as per: https://pywb.readthedocs.io/en/latest/manual/configuring.html?highlight=SOCKS#socks-proxy-for-live-web
How were you thinking of using the proxy? Proxy rotation would happy to be handled at the pywb level also, which is doable. Can you say more about the use case you have in mind?
Ah, I didn't see that capturing works by actually MITMing the traffic, i thought it hooked into some chrome API like archiveweb.page. So that's great it's already configurable, maybe could be added to the docs here?
How were you thinking of using the proxy? Proxy rotation would happy to be handled at the pywb level also, which is doable. Can you say more about the use case you have in mind?
Not really anything special, just basically archiving / scraping a larger number of web pages and distributing the traffic across multiple IPs in some fairly simple way (e.g. I use ssh to create 10 socks proxies, then let pywb distribute either each page load or each chrome instance onte the proxies). That way I can also run the chrome instances etc on a powerful machine and the outgoing traffic comes from elsewhere.
One more issue. When setting a proxy with
docker run -e SOCKS_HOST=localhost -e SOCKS_PORT=15257 ...
The fetching fails with errors like
in SOCKSProxyManager
raise InvalidSchema("Missing dependencies for SOCKS support.")
requests.exceptions.InvalidSchema: Missing dependencies for SOCKS support.
Additionally, these errors are not actually shown at all unless --logging pywb is passed, instead browsertrix-crawler reports all URLs as succeeded, and pages.jsonl has entries like
{
"id": "...",
"url": "...",
"title": "Pywb Error"
}
I'm not sure why these are reported as successes?
Also, the sitemap is not fetched via the proxy I think.
The socks issue can probably be fixed by replacing requests with requests[socks] in pywb/requirements.txt