wayback_archiver
                                
                                
                                
                                    wayback_archiver copied to clipboard
                            
                            
                            
                        Ruby gem to send URLs to Wayback Machine
WaybackArchiver
Post URLs to Wayback Machine (Internet Archive), using a crawler, from Sitemap(s), or a list of URLs.
The Wayback Machine is a digital archive of the World Wide Web [...] The service enables users to see archived versions of web pages across time ...
- Wikipedia
Index
- Installation
 - Usage
- Ruby
 - CLI
 
 - Configuration
 - RubyDoc
 - Contributing
 - MIT License
 - References
 
Installation
Install the gem:
$ gem install wayback_archiver
Or add this line to your application's Gemfile:
gem 'wayback_archiver'
And then execute:
$ bundle
Usage
- Ruby
 - CLI
 
Strategies:
auto(the default) - Will try to- Find Sitemap(s) defined in 
/robots.txt - Then in common sitemap locations 
/sitemap-index.xml,/sitemap.xmletc. - Fallback to crawling (using the excellent spidr gem)
 
- Find Sitemap(s) defined in 
 sitemap- Parse Sitemap(s), supports index files (and gzip)urls- Post URL(s)
Ruby
First require the gem
require 'wayback_archiver'
Examples:
Auto
# auto is the default
WaybackArchiver.archive('example.com')
# or explicitly
WaybackArchiver.archive('example.com', strategy: :auto)
Crawl
WaybackArchiver.archive('example.com',  strategy: :crawl)
Only send one single URL
WaybackArchiver.archive('example.com', strategy: :url)
Send multiple URLs
WaybackArchiver.archive(%w[example.com www.example.com], strategy: :urls)
Send all URL(s) found in Sitemap
WaybackArchiver.archive('example.com/sitemap.xml', strategy: :sitemap)
# works with Sitemap index files too
WaybackArchiver.archive('example.com/sitemap-index.xml.gz', strategy: :sitemap)
Specify concurrency
WaybackArchiver.archive('example.com', strategy: :auto, concurrency: 10)
Specify max number of URLs to be archived
WaybackArchiver.archive('example.com', strategy: :auto, limit: 10)
Each archive strategy can receive a block that will be called for each URL
WaybackArchiver.archive('example.com', strategy: :auto) do |result|
  if result.success?
    puts "Successfully archived: #{result.archived_url}"
  else
    puts "Error (HTTP #{result.code}) when archiving: #{result.archived_url}"
  end
end
Use your own adapter for posting found URLs
WaybackArchiver.adapter = ->(url) { puts url } # whatever that responds to #call
CLI
Usage:
wayback_archiver [<url>] [options]
Print full usage instructions
wayback_archiver --help
Examples:
Auto
# auto is the default
wayback_archiver example.com
# or explicitly
wayback_archiver example.com --auto
Crawl
wayback_archiver example.com --crawl
Only send one single URL
wayback_archiver example.com --url
Send multiple URLs
wayback_archiver example.com www.example.com --urls
Crawl multiple URLs
wayback_archiver example.com www.example.com --crawl
Send all URL(s) found in Sitemap
wayback_archiver example.com/sitemap.xml
# works with Sitemap index files too
wayback_archiver example.com/sitemap-index.xml.gz
Most options
wayback_archiver example.com www.example.com --auto --concurrency=10 --limit=100 --log=output.log --verbose
View archive: https://web.archive.org/web/*/http://example.com (replace http://example.com with to your desired domain).
Configuration
:information_source: By default wayback_archiver doesn't respect robots.txt files, see this Internet Archive blog post for more information.
Configuration (the below values are the defaults)
WaybackArchiver.concurrency = 1
WaybackArchiver.user_agent = WaybackArchiver::USER_AGENT
WaybackArchiver.respect_robots_txt = WaybackArchiver::DEFAULT_RESPECT_ROBOTS_TXT
WaybackArchiver.logger = Logger.new(STDOUT)
WaybackArchiver.max_limit = -1 # unlimited
WaybackArchiver.adapter = WaybackArchiver::WaybackMachine # must implement #call(url)
For a more verbose log you can configure WaybackArchiver as such:
WaybackArchiver.logger = Logger.new(STDOUT).tap do |logger|
  logger.progname = 'WaybackArchiver'
  logger.level = Logger::DEBUG
end
Pro tip: If you're using the gem in a Rails app you can set WaybackArchiver.logger = Rails.logger.
Docs
You can find the docs online on RubyDoc.
This gem is documented using yard (run from the root of this repository).
yard # Generates documentation to doc/
Contributing
Contributions, feedback and suggestions are very welcome.
- Fork it
 - Create your feature branch (
git checkout -b my-new-feature) - Commit your changes (
git commit -am 'Add some feature') - Push to the branch (
git push origin my-new-feature) - Create new Pull Request
 
License
MIT License
References
- Don't know what the Wayback Machine (Internet Archive) is? Wayback Machine
 - Don't know what a Sitemap is? sitemaps.org
 - Don't know what robot.txt is? www.robotstxt.org
 
