requests-html Make pyppeteer use proxies

If you're using proxies with requests-html and rendering JS sites is all good. Once you render a website pyppeteer don't know about this proxies and will expose your IP. This is an undesired behavior when scraping with proxies.

The idea is that whenever someone passes in proxies to the session object or any method call, make pyppeteer also use these proxies. #265

Feb 18 '19 14:02 oldani

This would be a good item to get fixed, currently when rendering I have to stop using proxy servers.

Feb 26 '19 18:02 Bobspadger

I will take on this

Feb 27 '19 16:02 oldani

cool thanks, I was going to take a look later but I'm not up on the whole async thing yet :)

Feb 27 '19 16:02 Bobspadger

I am in a very restrictive Coorporate Network and expiriencing many issues with Python and Proxies since the beginning of using requests-html. My goal is to scrape some cisco site, which has al lot of html returned by js - therefor I have to use the render functionality.

1st (solved manually) The initial Chromium Download of pyppeteer does not use proxies, so I had to download it manually and check where it expects to be:

python -c 'import pyppeteer; print(pyppeteer.chromium_downloader.chromiumExecutable)'

>>'win64': WindowsPath('C:/Users/XXX/AppData/Local/pyppeteer/pyppeteer/local-chromium/575458/chrome-win32/chrome.exe'

2nd (solved manually) Chromium does not accept Auth+Password given to --proxy-server="XXX" arg, see here

Now I am starting chromium with session = HTMLSession(browser_args=['--no-sandbox', '--proxy-pac-url="http://XXX/XXX.pac"']) while using the Proxy Auto Auth addon for chromium...

Start chrome.exe with the --proxy-pac-url="http://XXX/XXX.pac argument, enter your credentials and install the Proxy Auto Auth addon. Restart chrome.exe with the arguemts and check if you can use it without any proxy auth.

3rd (not solved yet) The render function does not use my proxy:

req = session.get(url=url, proxies=proxyDict, verify=False)
req.html.render()

pyppeteer.errors.PageError: net::ERR_NAME_NOT_RESOLVED at <URL>

I would be very happy if this can be solved ...

Apr 11 '19 11:04 ep4devops

+1 On this being an amazing thing to get resolved.

May 03 '19 20:05 FlyingZebra1

Are there any news about this issue? Scraping behind corporate proxies is impossible right now... Any planned progress on this? Thank you

Aug 22 '19 13:08 predicador37

Is there any news on this ? I saw this commit but don't know if it is the expected patch : https://github.com/psf/requests-html/pull/396

According to me, the best solution would be to be able to use proxies in the same way as requests do (from env or dict). Is it possible at this time ?