wayback-machine-downloader
wayback-machine-downloader copied to clipboard
Download fails
Can't download anything lately. Here's an example:
wayback_machine_downloader example.com
Downloading example.com to websites/example.com/ from Wayback Machine archives.
Getting snapshot pages................... found 25580 snaphots to consider.
5 files to download:
https://www.example.com/ # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com/index.html was empty and was removed.
https://www.example.com/ -> websites/example.com/index.html (1/5)
http://www.example.com/? # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com/?/index.html was empty and was removed.
http://www.example.com/? -> websites/example.com/?/index.html (2/5)
http://example.com/%2F/ # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com//index.html was empty and was removed.
http://example.com/%2F/ -> websites/example.com//index.html (3/5)
http://example.com/#main # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com/#main/index.html was empty and was removed.
http://example.com/#main -> websites/example.com/#main/index.html (4/5)
http://example.com/#/login # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com/#/login/index.html was empty and was removed.
http://example.com/#/login -> websites/example.com/#/login/index.html (5/5)
What I get as a result is a bunch of empty folders. Does anyone have a solution?
same here - guessing that wayback is breaking the connection after a small handful of requests...mine worked for the first 19 pages, then it began to fail...
This fix does work. It's a bit slow now of course, but the files get downloaded.
archive.org has implemented rate limiting, which is why the delay fixes things. It is unfortunate, and probably breaks multithreaded downloading as well, but it is a free resource after all. https://archive.org/details/toomanyrequests_20191110
can we get this fix approved and a new release created?
As far as I can tell archive.org is limiting the number of connections you can make in a short period of time.
As mentioned in #264, browsers and wget (which uses persistent connection) is not affected by this issue.
It should be fixed by using a single persistent connection for all downloads instead of creating a new connection for each download.
diff --git a/lib/wayback_machine_downloader.rb b/lib/wayback_machine_downloader.rb
index 730714a..199b9dd 100644
--- a/lib/wayback_machine_downloader.rb
+++ b/lib/wayback_machine_downloader.rb
@@ -206,11 +206,15 @@ class WaybackMachineDownloader
@processed_file_count = 0
@threads_count = 1 unless @threads_count != 0
@threads_count.times do
+ http = Net::HTTP.new("web.archive.org", 443)
+ http.use_ssl = true
+ http.start()
threads << Thread.new do
until file_queue.empty?
file_remote_info = file_queue.pop(true) rescue nil
- download_file(file_remote_info) if file_remote_info
+ download_file(file_remote_info, http) if file_remote_info
end
+ http.finish()
end
end
@@ -243,7 +247,7 @@ class WaybackMachineDownloader
end
end
- def download_file file_remote_info
+ def download_file (file_remote_info, http)
current_encoding = "".encoding
file_url = file_remote_info[:file_url].encode(current_encoding)
file_id = file_remote_info[:file_id]
@@ -268,8 +272,8 @@ class WaybackMachineDownloader
structure_dir_path dir_path
open(file_path, "wb") do |file|
begin
- URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}").open("Accept-Encoding" => "plain") do |uri|
- file.write(uri.read)
+ http.get(URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}")) do |body|
+ file.write(body)
end
rescue OpenURI::HTTPError => e
puts "#{file_url} # #{e}"
diff --git a/lib/wayback_machine_downloader.rb b/lib/wayback_machine_downloader.rb index 730714a..199b9dd 100644 --- a/lib/wayback_machine_downloader.rb +++ b/lib/wayback_machine_downloader.rb @@ -206,11 +206,15 @@ class WaybackMachineDownloader @processed_file_count = 0 @threads_count = 1 unless @threads_count != 0 @threads_count.times do + http = Net::HTTP.new("web.archive.org", 443) + http.use_ssl = true + http.start() threads << Thread.new do until file_queue.empty? file_remote_info = file_queue.pop(true) rescue nil - download_file(file_remote_info) if file_remote_info + download_file(file_remote_info, http) if file_remote_info end + http.finish() end end @@ -243,7 +247,7 @@ class WaybackMachineDownloader end end - def download_file file_remote_info + def download_file (file_remote_info, http) current_encoding = "".encoding file_url = file_remote_info[:file_url].encode(current_encoding) file_id = file_remote_info[:file_id] @@ -268,8 +272,8 @@ class WaybackMachineDownloader structure_dir_path dir_path open(file_path, "wb") do |file| begin - URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}").open("Accept-Encoding" => "plain") do |uri| - file.write(uri.read) + http.get(URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}")) do |body| + file.write(body) end rescue OpenURI::HTTPError => e puts "#{file_url} # #{e}"
This is an elegant (and working) solution. Nice one!
Thank you @ee3e!
Similarly, this should be implemented for get_all_snapshots_to_consider
:
In wayback_machine_downloader.rb
:
def get_all_snapshots_to_consider
# Note: Passing a page index parameter allow us to get more snapshots,
# but from a less fresh index
http = Net::HTTP.new("web.archive.org", 443)
http.use_ssl = true
http.start()
print "Getting snapshot pages"
snapshot_list_to_consider = []
snapshot_list_to_consider += get_raw_list_from_api(@base_url, nil, http)
print "."
unless @exact_url
@maximum_pages.times do |page_index|
snapshot_list = get_raw_list_from_api(@base_url + '/*', page_index, http)
break if snapshot_list.empty?
snapshot_list_to_consider += snapshot_list
print "."
end
end
http.finish()
puts " found #{snapshot_list_to_consider.length} snaphots to consider."
puts
snapshot_list_to_consider
end
...and in archive_api.rb
:
def get_raw_list_from_api url, page_index, http
request_url = URI("https://web.archive.org/cdx/search/xd")
params = [["output", "json"], ["url", url]]
params += parameters_for_api page_index
request_url.query = URI.encode_www_form(params)
begin
json = JSON.parse(http.get(URI(request_url)).body)
if (json[0] <=> ["timestamp","original"]) == 0
json.shift
end
json
rescue JSON::ParserError
[]
end
end
(Please check my code, but it worked for me to download a very large archive that I've been struggling with for a bit.)