wayback-machine-downloader icon indicating copy to clipboard operation
wayback-machine-downloader copied to clipboard

DO NOT USE unless you have a means of rate limiting yourself

Open jdimpson opened this issue 1 year ago • 12 comments

The Wayback Machine is (rightfully) blocking bulk downloads that exceed too much bandwidth or requests per secon. As far as I can tell, this product does no rate-limiting of itself, at least not by default, per any examples in the README. As a result, the Internet Archive will soft ban your IP address if you use this script on a web site of any significant size.

It's irresponsible to leave this repository up without at least a warning in the documentation.

jdimpson avatar Mar 07 '24 04:03 jdimpson

See ShiftaDeband's fork (which contains the fixes mentioned in his PR) as well as issues #273 and #275.

tinyapps avatar Mar 07 '24 23:03 tinyapps

See ShiftaDeband's fork (which contains the fixes mentioned in his PR) as well as issues #273 and #275.

Sorry to bother, i'm pretty new in this, how can i actually use this fork instead of the master branch?

Elmagenta avatar Apr 05 '24 20:04 Elmagenta

@Elmagenta: You'll need to have Ruby installed then you can just download ShiftaDeband's fork as a ZIP file, unzip it, and run wayback_machine_downloader which you'll find in the bin subdirectory.

tinyapps avatar Apr 06 '24 04:04 tinyapps

@tinyapps I'm also pretty new in this, and I couldn't follow your instructions. I have Ruby installed, and I had also installed the "original" wayback_machine_downloader via Mac OS Terminal. Now, following your instructions, I downloaded the ZIP file and simply tried to run the binary file. But I get an error message

/Users/flag/Downloads/wayback-machine-downloader-feature-httpGet/bin/wayback_machine_downloader:3:in `require_relative': cannot load such file -- /Users/flag/Downloads/wayback-machine-downloader-feature-httpGet/lib/wayback_machine_downloader (LoadError) from /Users/flag/Downloads/wayback-machine-downloader-feature-httpGet/bin/wayback_machine_downloader:3:in "

"

Could you give more details on how to proceed?

flag-br avatar Apr 07 '24 21:04 flag-br

@flag-br: Sounds like you might've deleted (or not extracted) the included lib directory or its contents. After unzipping wayback-machine-downloader-feature-httpGet.zip, just cd into the bin subdirectory and run wayback_machine_downloader without deleting any of the other included files or folders. The directory structure should look like this:

.
├── Dockerfile
├── Gemfile
├── MIT-LICENSE.txt
├── README.md
├── Rakefile
├── bin
│   └── wayback_machine_downloader
├── lib
│   ├── wayback_machine_downloader
│   │   ├── archive_api.rb
│   │   ├── tidy_bytes.rb
│   │   └── to_regex.rb
│   └── wayback_machine_downloader.rb
├── test
│   └── test_wayback_machine_downloader.rb
└── wayback_machine_downloader.gemspec

tinyapps avatar Apr 07 '24 21:04 tinyapps

@tinyapps Thank you very much, it worked! It ran normally, but the final product is practically the same as what I was getting before with the master branch version. The folder structure apparently reproduced correctly on my machine, but only 15 htm files were downloaded. To check, I ran wayback_machine_downloader with the --list option, and the answer is that there are 1116 htm files.

The command I'm using is (after cd to bin folder): wayback_machine_downloader https://jazzdiscogcorner.pagesperso-orange.fr/

This site is quite simple, just text and practically no images.

Am I doing something wrong?

flag-br avatar Apr 08 '24 01:04 flag-br

@flag-br: Glad to hear it worked out. As for issues with a specific site, I'd recommend checking out the documentation and searching through the open and closed issues before posting a new issue.

tinyapps avatar Apr 08 '24 18:04 tinyapps

@flag-br: Sounds like you might've deleted (or not extracted) the included lib directory or its contents. After unzipping wayback-machine-downloader-feature-httpGet.zip, just cd into the bin subdirectory and run wayback_machine_downloader without deleting any of the other included files or folders. The directory structure should look like this:

.
├── Dockerfile
├── Gemfile
├── MIT-LICENSE.txt
├── README.md
├── Rakefile
├── bin
│   └── wayback_machine_downloader
├── lib
│   ├── wayback_machine_downloader
│   │   ├── archive_api.rb
│   │   ├── tidy_bytes.rb
│   │   └── to_regex.rb
│   └── wayback_machine_downloader.rb
├── test
│   └── test_wayback_machine_downloader.rb
└── wayback_machine_downloader.gemspec

I'm being stupid here, but trying to run wayback_machine_downloader (type - file) in the bin directory gave me "not recognized as an internal or external command, operable program or batch file". Fresh Ruby install.

I had to gem build wayback_machine_downloader.gemspec, then gem install wayback_machine_downloader-2.3.2.gem that was generated, and finally I could run wayback_machine_downloader from cmd in a working fashion. Any advice on what I was doing wrong?

eggplantedd avatar Apr 29 '24 12:04 eggplantedd

It would be great to have rate limiting added to this software. Without it archive.org is (rightfully) returning "Connection refused" errors.

P.S. It is good that there is a fork with fixes. Just wishing that the main repo of this software had those fixes too.

CaptSolo avatar May 12 '24 21:05 CaptSolo

This patched version worked beautifully ...

For those who are in Windows and do not understand much how to do it:

gem install wayback_machine_downloader

Replace bin and lib folders in: C:\Ruby33-x64\lib\ruby\gems\3.3.0\gems\wayback_machine_downloader-2.3.1 for those in the compressed file. https://github.com/ShiftaDeband/wayback-machine-downloader/archive/refs/heads/feature/httpGet.zip

nico9julio avatar Jul 05 '24 14:07 nico9julio

Doesnt seem to work anymore... gives Net::ReadTimeout with #<TCPSocket:(closed)> (Net::ReadTimeout)

irrdkwhattoput avatar Jul 25 '24 21:07 irrdkwhattoput