trafilatura Proxy support to Trafilatura

Currently, I'm having trouble accessing some websites and I believe that using a proxy might help solve this issue.

If there is no natively proxy support in Trafilatura (didn't find in docs), I would like to suggest adding this functionality for future versions.

Apr 23 '23 00:04 andremacola

Apparently urllib3 has chosen not to read environment proxy variables https://github.com/urllib3/urllib3/issues/1785

Apr 23 '23 19:04 andremacola

In https://github.com/adbar/trafilatura/blob/82043f7e84d256571cc1861249e27103193508ba/trafilatura/downloads.py#L104 we could use something like:

if use_proxy:
    HTTP_POOL = urllib3.ProxyManager(retries=RETRY_STRATEGY, timeout=config.getint('DEFAULT', 'DOWNLOAD_TIMEOUT'), ca_certs=certifi.where(), num_pools=NUM_CONNECTIONS, proxy_url=PROXY_HOST, proxy_headers=PROXY_HEADERS)
else:
    HTTP_POOL = urllib3.PoolManager(retries=RETRY_STRATEGY, timeout=config.getint('DEFAULT', 'DOWNLOAD_TIMEOUT'), ca_certs=certifi.where(), num_pools=NUM_CONNECTIONS)

PROXY_**** variables could come from config or maybe from os env variables or params in trafilatura.fetch_url() if the user want to manipulate some random use of proxies.

The behavior of ProxyManager is the same as PoolManager: https://urllib3.readthedocs.io/en/stable/reference/urllib3.poolmanager.html#urllib3.ProxyManager

What do you think?

Apr 23 '23 20:04 andremacola

Trafilatura's download utilities should stay simple in order not to confuse users. There are lots of alternatives and downloading at scale is a different challenge altogether. A worst case solution would be to use another software for downloads and to process the resulting HTML files with Trafilatura.

That being said, if you can find a easy way to perform HTTP requests with a proxy then it could be an interesting additional feature.

Apr 24 '23 12:04 adbar

Trafilatura's download utilities should stay simple in order not to confuse users. There are lots of alternatives and downloading at scale is a different challenge altogether. A worst case solution would be to use another software for downloads and to process the resulting HTML files with Trafilatura.

Second this. There are tons of efficient ways for downloading. Trafilatura should stay focused on its main task: extraction. It's not a good idea to make it more bloated with unnecessary features. Just use scraping tools for scraping. Trafilatura should not be a Swiss army knife tool.

Jun 25 '23 08:06 fortyfourforty