colly
colly copied to clipboard
Can't use proxy tools like scraproxy
Scraproxy accepts requests as HTTP but the HTTPS URL must be in the Location header, source: http://docs.scrapoxy.io/en/master/advanced/understand/index.html#can-scrapoxy-relay-https-requests
go-colly doesn't seem to support this, if the URL is HTTPS and the only proxy available is HTTP, go-colly seems to just skip over it and not use a proxy. I've tested it, and it works perfectly when the website is just HTTP.
Would there be any way to get around this?
Scraproxy accepts requests as HTTP but the HTTPS URL must be in the Location header, source: http://docs.scrapoxy.io/en/master/advanced/understand/index.html#can-scrapoxy-relay-https-requests
go-colly doesn't seem to support this, if the URL is HTTPS and the only proxy available is HTTP, go-colly seems to just skip over it and not use a proxy. I've tested it, and it works perfectly when the website is just HTTP.
Would there be any way to get around this?
Have you tried SOCKS4/5 to solve this problem?
If you are interested, Scrapoxy 4 is out:
Scrapoxy is a open source proxy aggregator, allowing you to manage all proxies in one place π―, rather than spreading it across multiple scrapers πΈοΈ.
Smartly designed for efficient traffic routing π, Scrapoxy minimizes #bans and boosts success rates π.
The tech stack is built on the latest NodeJS, Typescript, utilizing the NestJS and Angular frameworks.
Here are the key features:
- βοΈ Cloud Providers with easy installation: Scrapoxy supports many cloud providers like AWS, Azure, or GCP.
- π Proxy Services: Scrapoxy supports many proxy services like Rayobyte, IPRoyal or Zyte.
- π» Hardware materials: Scrapoxy supports many 4G proxy farms hardware types, like Proxidize or XProxy.io.
- π Free Proxy Lists: Scrapoxy supports lists of HTTP/HTTPS proxies and SOCKS4/SOCKS5 proxies.
- β° Timeout free: Scrapoxy only routes traffic to online proxies to avoid inactive connection.
- π Auto-Rotate proxies: Scrapoxy automatically changes IP addresses at regular intervals.
- π Auto-Scale proxies: Scrapoxy monitors incoming traffic and automatically scales the number of proxies according to your needs.
- πͺ Sticky sessions on Browser: Scrapoxy keeps the same IP address for a scraping session, even for browsers.
- π¨ Ban management: Scrapoxy injects the name of the proxy into the HTTP responses.
- π‘ Traffic interception: Scrapoxy intercepts HTTP requests/responses to modify headers, keeping consistency in your scraping stack. It can add session cookies or specific headers like user-agent.
- π Traffic monitoring: Scrapoxy measures incoming and outgoing traffic to provide an overview of your scraping session.
- π Coverage monitoring: Scrapoxy displays the geographic coverage of your proxies to better understand the global distribution of your proxies.
- π Easy-to-use and production-ready: Scrapoxy is suitable for both beginners and experts (Kubernetes / Helm).
- π Open Source: And of course, Scrapoxy is open source, under the MIT license.
Checkout https://scrapoxy.io/ !