colly icon indicating copy to clipboard operation
colly copied to clipboard

Can't use proxy tools like scraproxy

Open Monkleys opened this issue 5 years ago β€’ 2 comments

Scraproxy accepts requests as HTTP but the HTTPS URL must be in the Location header, source: http://docs.scrapoxy.io/en/master/advanced/understand/index.html#can-scrapoxy-relay-https-requests

go-colly doesn't seem to support this, if the URL is HTTPS and the only proxy available is HTTP, go-colly seems to just skip over it and not use a proxy. I've tested it, and it works perfectly when the website is just HTTP.

Would there be any way to get around this?

Monkleys avatar May 26 '19 19:05 Monkleys

Scraproxy accepts requests as HTTP but the HTTPS URL must be in the Location header, source: http://docs.scrapoxy.io/en/master/advanced/understand/index.html#can-scrapoxy-relay-https-requests

go-colly doesn't seem to support this, if the URL is HTTPS and the only proxy available is HTTP, go-colly seems to just skip over it and not use a proxy. I've tested it, and it works perfectly when the website is just HTTP.

Would there be any way to get around this?

Have you tried SOCKS4/5 to solve this problem?

alaaelgndy avatar Aug 30 '21 19:08 alaaelgndy

If you are interested, Scrapoxy 4 is out:

Scrapoxy is a open source proxy aggregator, allowing you to manage all proxies in one place 🎯, rather than spreading it across multiple scrapers πŸ•ΈοΈ.

Smartly designed for efficient traffic routing πŸ”€, Scrapoxy minimizes #bans and boosts success rates πŸš€.

The tech stack is built on the latest NodeJS, Typescript, utilizing the NestJS and Angular frameworks.

Here are the key features:

  • ☁️ Cloud Providers with easy installation: Scrapoxy supports many cloud providers like AWS, Azure, or GCP.
  • 🌐 Proxy Services: Scrapoxy supports many proxy services like Rayobyte, IPRoyal or Zyte.
  • πŸ’» Hardware materials: Scrapoxy supports many 4G proxy farms hardware types, like Proxidize or XProxy.io.
  • πŸ“œ Free Proxy Lists: Scrapoxy supports lists of HTTP/HTTPS proxies and SOCKS4/SOCKS5 proxies.
  • ⏰ Timeout free: Scrapoxy only routes traffic to online proxies to avoid inactive connection.
  • πŸ”„ Auto-Rotate proxies: Scrapoxy automatically changes IP addresses at regular intervals.
  • πŸƒ Auto-Scale proxies: Scrapoxy monitors incoming traffic and automatically scales the number of proxies according to your needs.
  • πŸͺ Sticky sessions on Browser: Scrapoxy keeps the same IP address for a scraping session, even for browsers.
  • 🚨 Ban management: Scrapoxy injects the name of the proxy into the HTTP responses.
  • πŸ“‘ Traffic interception: Scrapoxy intercepts HTTP requests/responses to modify headers, keeping consistency in your scraping stack. It can add session cookies or specific headers like user-agent.
  • πŸ“Š Traffic monitoring: Scrapoxy measures incoming and outgoing traffic to provide an overview of your scraping session.
  • 🌍 Coverage monitoring: Scrapoxy displays the geographic coverage of your proxies to better understand the global distribution of your proxies.
  • πŸš€ Easy-to-use and production-ready: Scrapoxy is suitable for both beginners and experts (Kubernetes / Helm).
  • πŸ”“ Open Source: And of course, Scrapoxy is open source, under the MIT license.

Checkout https://scrapoxy.io/ !

fabienvauchelles avatar Jan 31 '24 10:01 fabienvauchelles