scrapoxy
scrapoxy copied to clipboard
how can i find the proxy
I have it working using docker container and AWS instances. However it is unclear how to know which proxy instance to stop when using https. I understand the proxy cannot read request headers or change response headers. OK so the docs suggest to connect to proxy over over http. How do you do that?
Tried using python/requests; then python httpclient. Also tried raw messages using ncat and the code in the documentation as below. However I just get request error from target "socket hangup".
GET /index.html
Host: localhost:8888
Location: https://www.google.com/index.html
Accept: text/html
I'm also encountering similar issues. I understand that the documentation outlines the different modes that can be used to have Scrapoxy work as an HTTPS proxy. I've been able to get it to work using the HTTP CONNECT method fine (modes A and B), but not with mode c (the "no tunnel" mode).
The issue with using HTTP CONNECT is that:
- Using MITM (mode A) triggers SSL alerts. This effectively makes that mode useless for a large number of websites.
- Using CONNECT without MITM (mode B) makes it impossible to override the response headers from the target server to include to the proxy name. This makes managing the proxies difficult: for example, if I get a response from the target server that leads me to believe the proxy's IP has been banned, I want to kill that particular node in the cluster; however, without MITM, it's impossible for me to know which node needs to be killed.
This means that mode A fails for a large number of websites but makes cluster management possible, and mode B works everywhere but makes cluster management impossible. It seems to me, then, that the only practical solution is to use a solution without tunnelling, but I can't seem to get this to work with python requests and there's little documentation for mode C. I'm open to rolling my own custom solution where I manage the sockets myself, but I'd rather not reinvent the wheel unless absolutely necessary.
I did do some digging into scrapy to see where that ?noconnect URL param is being processed, and it seems like it eventually feeds into twisted. Is there something necessarily asynchronous about mode C that requires something like twisted?
I'll continue digging, but ideally I'd be able to get this working out of the box with python requests.
The simplest answer is to http to the proxy manager which then handles the https request and response.
Looks like scrapoxy is no longer maintained. I ended up writing my own https://github.com/simonm3/mproxy.
On Sun, 26 Apr 2020 at 16:03, Josh Baiad [email protected] wrote:
I'm also encountering similar issues. I understand that the documentation https://scrapoxy.readthedocs.io/en/master/advanced/understand/index.html#can-scrapoxy-relay-https-requests outlines the different modes that can be used to have Scrapoxy work as an HTTPS proxy. I've been able to get it to work using the HTTP CONNECT method fine (modes A and B), but not with mode c (the "no tunnel" mode).
The issue with using HTTP CONNECT is that:
- Using MITM (mode A) triggers SSL alerts. This effectively makes that mode useless for a large number of websites.
- Using CONNECT without MITM (mode B) makes it impossible to override the response headers from the target server to include to the proxy name. This makes managing the proxies difficult: for example, if I get a response from the target server that leads me to believe the proxy's IP has been banned, I want to kill that particular node in the cluster; however, without MITM, it's impossible for me to know which node needs to be killed.
This means that mode A fails for a large number of websites but makes cluster management possible, and mode B works everywhere but makes cluster management impossible. It seems to me, then, that the only practical solution is to use a solution without tunnelling, but I can't seem to get this to work with python requests and there's little documentation for mode C. I'm open to rolling my own custom solution where I manage the sockets myself, but I'd rather not reinvent the wheel unless absolutely necessary.
I did do some digging into scrapy to see where that ?noconnect URL param is being processed, and it seems like it eventually feeds into twisted. Is there something necessarily asynchronous about mode C that requires something like twisted?
I'll continue digging, but ideally I'd be able to get this working out of the box with python requests.
β You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/fabienvauchelles/scrapoxy/issues/183#issuecomment-619566274, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJE32PZEIIZPQMXHUI7QI3ROREK7ANCNFSM4LV6DJHA .
@jbaiad How did u get mode A to work?
all modes are supported in Scrapoxy 4.0.0:
- HTTP over HTTP
- HTTPS over HTTP
- HTTPS over HTTP (without MITM)
- HTTPS over HTTP (with MITM)
Hey there! π Exciting news! Scrapoxy 4 is ready to rock π. Check it out at Scrapoxy.io (explore the "get started" guide, deployment documentation, and more π). I can't wait to hear your feedback on this new version! Send me your coolest screenshots with as many proxies as possible! πΈπ» Join the Discord community if you have any questions or just want to chat. You can also open a GitHub issue for any bug or feature request πβ¨. See you soon! π Fabien