wayback-machine-downloader icon indicating copy to clipboard operation
wayback-machine-downloader copied to clipboard

Download fails

Open ingvarr777 opened this issue 1 year ago • 7 comments

Can't download anything lately. Here's an example:

wayback_machine_downloader example.com
Downloading example.com to websites/example.com/ from Wayback Machine archives.

Getting snapshot pages................... found 25580 snaphots to consider.

5 files to download:
https://www.example.com/ # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com/index.html was empty and was removed.
https://www.example.com/ -> websites/example.com/index.html (1/5)
http://www.example.com/? # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com/?/index.html was empty and was removed.
http://www.example.com/? -> websites/example.com/?/index.html (2/5)
http://example.com/%2F/ # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com//index.html was empty and was removed.
http://example.com/%2F/ -> websites/example.com//index.html (3/5)
http://example.com/#main # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com/#main/index.html was empty and was removed.
http://example.com/#main -> websites/example.com/#main/index.html (4/5)
http://example.com/#/login # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com/#/login/index.html was empty and was removed.
http://example.com/#/login -> websites/example.com/#/login/index.html (5/5)

What I get as a result is a bunch of empty folders. Does anyone have a solution?

ingvarr777 avatar Nov 12 '23 18:11 ingvarr777

same here - guessing that wayback is breaking the connection after a small handful of requests...mine worked for the first 19 pages, then it began to fail...

jomo06 avatar Nov 14 '23 15:11 jomo06

This fix does work. It's a bit slow now of course, but the files get downloaded.

ingvarr777 avatar Nov 19 '23 03:11 ingvarr777

archive.org has implemented rate limiting, which is why the delay fixes things. It is unfortunate, and probably breaks multithreaded downloading as well, but it is a free resource after all. https://archive.org/details/toomanyrequests_20191110

sww1235 avatar Nov 20 '23 03:11 sww1235

can we get this fix approved and a new release created?

technomaz avatar Dec 14 '23 22:12 technomaz

As far as I can tell archive.org is limiting the number of connections you can make in a short period of time.

As mentioned in #264, browsers and wget (which uses persistent connection) is not affected by this issue.

It should be fixed by using a single persistent connection for all downloads instead of creating a new connection for each download.

diff --git a/lib/wayback_machine_downloader.rb b/lib/wayback_machine_downloader.rb
index 730714a..199b9dd 100644
--- a/lib/wayback_machine_downloader.rb
+++ b/lib/wayback_machine_downloader.rb
@@ -206,11 +206,15 @@ class WaybackMachineDownloader
     @processed_file_count = 0
     @threads_count = 1 unless @threads_count != 0
     @threads_count.times do
+      http = Net::HTTP.new("web.archive.org", 443)
+      http.use_ssl = true
+      http.start()
       threads << Thread.new do
         until file_queue.empty?
           file_remote_info = file_queue.pop(true) rescue nil
-          download_file(file_remote_info) if file_remote_info
+          download_file(file_remote_info, http) if file_remote_info
         end
+        http.finish()
       end
     end

@@ -243,7 +247,7 @@ class WaybackMachineDownloader
     end
   end

-  def download_file file_remote_info
+  def download_file (file_remote_info, http)
     current_encoding = "".encoding
     file_url = file_remote_info[:file_url].encode(current_encoding)
     file_id = file_remote_info[:file_id]
@@ -268,8 +272,8 @@ class WaybackMachineDownloader
         structure_dir_path dir_path
         open(file_path, "wb") do |file|
           begin
-            URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}").open("Accept-Encoding" => "plain") do |uri|
-              file.write(uri.read)
+            http.get(URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}")) do |body|
+              file.write(body)
             end
           rescue OpenURI::HTTPError => e
             puts "#{file_url} # #{e}"

ee3e avatar Dec 22 '23 21:12 ee3e

diff --git a/lib/wayback_machine_downloader.rb b/lib/wayback_machine_downloader.rb
index 730714a..199b9dd 100644
--- a/lib/wayback_machine_downloader.rb
+++ b/lib/wayback_machine_downloader.rb
@@ -206,11 +206,15 @@ class WaybackMachineDownloader
     @processed_file_count = 0
     @threads_count = 1 unless @threads_count != 0
     @threads_count.times do
+      http = Net::HTTP.new("web.archive.org", 443)
+      http.use_ssl = true
+      http.start()
       threads << Thread.new do
         until file_queue.empty?
           file_remote_info = file_queue.pop(true) rescue nil
-          download_file(file_remote_info) if file_remote_info
+          download_file(file_remote_info, http) if file_remote_info
         end
+        http.finish()
       end
     end

@@ -243,7 +247,7 @@ class WaybackMachineDownloader
     end
   end

-  def download_file file_remote_info
+  def download_file (file_remote_info, http)
     current_encoding = "".encoding
     file_url = file_remote_info[:file_url].encode(current_encoding)
     file_id = file_remote_info[:file_id]
@@ -268,8 +272,8 @@ class WaybackMachineDownloader
         structure_dir_path dir_path
         open(file_path, "wb") do |file|
           begin
-            URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}").open("Accept-Encoding" => "plain") do |uri|
-              file.write(uri.read)
+            http.get(URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}")) do |body|
+              file.write(body)
             end
           rescue OpenURI::HTTPError => e
             puts "#{file_url} # #{e}"

This is an elegant (and working) solution. Nice one!

JXGA avatar Jan 09 '24 20:01 JXGA

Thank you @ee3e!

Similarly, this should be implemented for get_all_snapshots_to_consider:

In wayback_machine_downloader.rb:

  def get_all_snapshots_to_consider
    # Note: Passing a page index parameter allow us to get more snapshots,
    # but from a less fresh index
    http = Net::HTTP.new("web.archive.org", 443)
    http.use_ssl = true
    http.start()
    print "Getting snapshot pages"
    snapshot_list_to_consider = []
    snapshot_list_to_consider += get_raw_list_from_api(@base_url, nil, http)
    print "."
    unless @exact_url
      @maximum_pages.times do |page_index|
        snapshot_list = get_raw_list_from_api(@base_url + '/*', page_index, http)
        break if snapshot_list.empty?
        snapshot_list_to_consider += snapshot_list
        print "."
      end
    end
    http.finish()
    puts " found #{snapshot_list_to_consider.length} snaphots to consider."
    puts
    snapshot_list_to_consider
  end

...and in archive_api.rb:

  def get_raw_list_from_api url, page_index, http
    request_url = URI("https://web.archive.org/cdx/search/xd")
    params = [["output", "json"], ["url", url]]
    params += parameters_for_api page_index
    request_url.query = URI.encode_www_form(params)

    begin
      json = JSON.parse(http.get(URI(request_url)).body)
      if (json[0] <=> ["timestamp","original"]) == 0
        json.shift
      end
      json
    rescue JSON::ParserError
      []
    end
  end

(Please check my code, but it worked for me to download a very large archive that I've been struggling with for a bit.)

ShiftaDeband avatar Feb 08 '24 05:02 ShiftaDeband