HTTrack Ignores Proxies
OPERATING SYSTEM: WINDOWS 7 x64 Architectural Version of HTTrack pertaining to the bug: Win32 VERSION: 3.48-21 RAM: 4 GB CPU: Pentium Dual Core T4200 @ 2 GHz
I am trying to download a large forum that has a great deal of informational resource that is about to be shut down in a few more days that is hosted only in HTTPS so I am using HTTrack to crawl this forum locally into my computer. However, I have encountered a few problems getting HTTrack to connect using a proxy server to perform the crawling in order to bypass the server's imposed rate limitations of 0.08 requests (over 4.6 requests per minute) and if you go higher than this you get banned for 6 hours (!!). Anyways, to mitigate this issue I configured MultiProxy (a desktop software that handles multiple HTTP/HTTPS and Socks proxies and randomly chooses one per each request) with over 50 HTTPS capable working proxies that I had tested before hand and are private proxies (not the ones freely available), I configured MultiProxy to handle 32 simultaneous connections (not saying I was going to use all at once initially as HTTrack already has a hard limit of max 8 connections) and checked the box "choose random proxies" and checked the box saying "use any working proxies from the list regardless of anonymity level", I left the default IP and port alone since I found no compelling reason to change them, so I tested my configuration by pointing Firefox to 127.0.0.1:8088 as its proxy setting and then going to whatismyipaddress.com to check for IP address, and sure I saw my proxy's IP address, furthermore on the MultiProxy status window it displayed in realtime the proxy's IP address connection and the transfer rate of the data transfer until the http page fully loaded and then the status self cleared. If I opened three tabs and went to any ramdom website quickly, MultiProxy would reflect that its working by display three separate lines of activity, each with a unique proxy server randomly chosen from my pool of 50 proxy servers. If you go to whatismyipaddress.com and keep on refreshing the page, the refreshed page will print new unique IP address which once again indicates that MultiProxy is working, my configuration is working and that there is nothing wrong with my proxy server configuration.
Why did I mention my current MultiProxy setup configuration and proof of work? Because before I start talking about HTTrack's problem with ignoring proxy server settings I wanted to furnish as much information as possible to prove that the problem isn't being caused by MultiProxy or my configuration.
Now, the problem:
If you configure HTTrack to use a proxy server, like in my case, instruct HTTrack to use: 127.0.0.1:8088 which corresponds to MultiProxy and you start your current project, HTTrack still connects using your real IP address and the proxy server isn't used, its like if you never put it there in the first place and there is zero activity shown on MultiProxy's realtime status window which is further proof that the proxy is not even being used.
Yesterday I spent from 8 PM thorugh 10 AM trying different things, googling the problem to no avail which it must be HTTrack's bug.
My theory: I think HTTrack's proxy server only works for HTTP only website and for HTTPS website it bypasses your proxy server and uses your real IP address.
I have tried everything and HTTrack would absolutely refuse to use my proxy server for downloading from an HTTPS based website.
I hope that this bug gets fixed so that I can be able to use my proxy server to download from HTTPS based websites much quicker.
Right now, I have HTTrack limited to the abysmal 0.08 rate limitations at 1 connection at a time and its been running for over 20 hours so far at this speed and I have only saved over 4,500 links (800 MB long) and I dont long how much longer I have to go since this is a very large website and it may go down in any day. I have given wget a try, but wget's lack of inbuilt multi thread means that you can only run 1 thread at a time, but my proxy do work with it and I saw the activity on my MultiProxy status window, but 24 hours later I only downloaded 241 MB over 1/4 slower than what I was able to do with HTTrack in just 20 hours so I just cancelled wget and just left HTtrack running at the 0.08 rate at 1 connections at a time.
Edit: Another thing that I did in my desperate attempt to get HTTrack to actually USE my proxy server for the HTTPS website was to add HTTrack into Freecap and configure freecap to use MultiProxy, but HTTrack will not load under freecap, it crashes, so this leaves the freecap solution out.
SO IN SHORT, YOU CAN CONFIGURE HTTRACK TO USE ANY PROXY SERVER AND HTTRACK WILL IGNORE THE PROXY SERVER AND JUST USE YOUR COMPUTER'S REAL IP ADDRESS WHICH WILL GET YOUR IP ADDRESS BANNED FROM THE WEB SERVER YOU ARE ATTEMPTING TO CRAWL IF YOU GO TO FAST THINKING THAT YOU WERE USING YOUR PROXIES.
Architectural Version of HTTrack pertaining to the bug: Win32 VERSION: 3.48-21
+1
@NatureCoin - did you ever find a workaround? Or perhaps was it your configuration and not a problem with HTTrack itself?
Might these be of use?
- https://www.httrack.com/html/step9_opt7.html
- https://forum.httrack.com/readmsg/16717/16681/index.html
Haven't read them myself yet, but will do soon.
I can confirm that this issue still exists in version 3.48.22. Downloaded source and cross-compiled to a linux variant.
I have a weird behaviour with https:// website on 3.49 Linux version as well.
If I do httrack https://www.website.org --proxy localhost:3128 -O "/path/to/websites/" -%v - httrack simply ignores the proxy setting (there is a tested-by-browser working HTTP proxy on localhost:3128).
If I do httrack www.website.org --proxy localhost:3128 -O "/path/to/websites/" -%v - apparently something gets downloaded, but when opening index.html at downloaded location, it just shows "Click here..." at the top and never does anything whether I click it or not.
I am using version 3.49-2, issue still exists, just like @NatureCoin said, proxy only works for HTTP and not works for HTTPS. issue #179 may same problem.