scrapy-rotating-proxies icon indicating copy to clipboard operation
scrapy-rotating-proxies copied to clipboard

Read proxy list from an URL

Open datawookie opened this issue 3 years ago • 4 comments

Hi!

We build a lot of web scrapers using Scrapy and I've been using your package for a while now. It's great for managing our multi-proxy setup.

We have been developing a proxy system that shares the proxy list via a URL. I have been dumping the contents of that URL to a file so that I can read it in via ROTATING_PROXY_LIST_PATH but this has become a bit of a pain. It occurred to me that it should be possible to read the proxy list from an URL.

The merge request includes a simple change to the RotatingProxyMiddleware.from_crawler() method to make that possible.

Example: Sharing proxy list at http://127.0.0.1:8800.

image

In settings.py I then have:

ROTATING_PROXY_LIST_PATH = 'http://127.0.0.1:8800'

For context, here's a blog post about the proxy system that we are using in conjunction with scrapy-rotating-proxies.

Best regards, Andrew.

datawookie avatar Oct 01 '21 01:10 datawookie

The link to your blog post should be: https://datawookie.dev/blog/2021/10/medusa-multi-headed-tor-proxy/ (instead of pointing to localhost) ;) Great work btw!

kaybeudeker avatar Nov 28 '21 09:11 kaybeudeker

Thanks, @kaybeudeker, I've updated the URL. Appreciate you bringing that to my attention.

Have you tried this out? I'd really appreciate any feedback.

datawookie avatar Nov 28 '21 15:11 datawookie

I had a similar use case to read proxies from an URL (specifically an API call to a third party which returns a list of proxies - exactly like you have) - I created a small utility function which uses requests.get to fetch the proxies and assigns the result to ROTATING_PROXY_LIST_PATH in settings.py.

utility function:

`def get_proxies(proxy_json_end_point: str) -> List[str]: r = requests.get(proxy_json_end_point) proxies = r.json()

proxy_urls = [
    f"http://{user}:{pwd}@{host_port}"
    for (host_port, user, pwd) in [p.split(";") for p in proxies]
]
random.shuffle(proxy_urls)
print("Proxies:", proxy_urls)
return proxy_urls`

settings.py

ROTATING_PROXY_LIST = get_act_proxies(os.getenv("PROXY_JSON_ENDPOINT"))

note - the PROXY_JSON_ENDPOINT env variable points to the third-party's API endpoint which returns the proxies. I used a similar approach to even fetch proxies listed in text file hosted in S3.

SashiDareddy avatar Feb 20 '22 19:02 SashiDareddy

Hi @TeamHG-Memex, any progress on this? This PR has been languishing for a few months now. Thanks, Andrew.

datawookie avatar Feb 21 '22 05:02 datawookie