browsertrix
browsertrix copied to clipboard
(Note: This repository is obsolete, please see the new Browsertrix webrecorder/browsertrix) Browser-Based On-Demand Web Archiving Automation
Note: This repository is obsolete and represents an original attempt at browser automation.
Please see the new Browsertrix at webrecorder/browsertrix
Browsertrix 0.1.1
Browsertrix is a web archiving automation system, desgined to create high-fidelity web archives by automating real browsers running in containers (Docker) using Selenium and other automation tools. The system does not currently do any archiving of its own, but automates browsing loading through existing archiving and recording tools.
By loading pages directly through a browser, it will be possible to fully recreate a page as the user experiences it, including all dynamic content and interaction.
Browsertrix is named after Heritrix, the venerable web crawler technology which has become a standard for web archiving.
What Browsertrix Does
The first iteration of Browsertrix supports archiving a single web page, through an existing archiving back-end.
Urls can be submitted to Browsertrix via HTTP and it will attempt to load the urls in an available browser right away.
Browsertrix can operate synchronously or asynchronously. If the operation does not complete within the specified timeout
(default 30 secs), a queued response is returned and the user may retry the operation to get the result at a later time.
The results of the archiving operation are cached (for 10 mins if successful, for 30 secs otherwise) so that future requests will return the cached result.
Redis is used to queue urls for archiving, and cache results for the archiving operation. Configurable options
are currently available in the config.py module.
Additional automated browser "crawling" and multi-url features are planned for the next iteration.
Installation
Docker and Docker Compose are the only requirements for running Browsertrix.
Install Docker as recommended at: https://docs.docker.com/installation/
Install Docker Compose with: pip install docker-compose
After cloning this repository, run docker-compose up
Web Interface
In this version, a basic 'Archive This Website' UI is available on the home page and provides a form to submit urls to be archived through Chrome or Firefox. The interfaces wraps the Archiving API explained below.
The supported backends are https://webrecorder.io/ and IA Save Page Now feature.
http://$DOCKER_HOST/ where DOCKER_HOST is the host where Docker is running.
Scaling Workers
By default, Browsertrix starts with one Chrome and one Firefox worker. docker-compose scale can be used
to set the number of workers as needed.
The set-scale.sh script is provided as a convenience to resize the number of workers, resizing both
the Chrome and Firefox workers. For example, to have 4 of each browser, you can run:
./set-scale.sh 4
Archiving API /archivepage
This first iteration of Browsertrix provides an api endpoint at the /archivepage endpoint for archiving a single page.
To archive a url, a GET request can be made to http://<DOCKER HOST>/archivepage?url=URL&archive=ARCHIVE[&browser=browser]
-
url- The URL to be archived -
archive- One of the available archives specified inconfig.py. Current archives areia-saveandwebrecorder -
browser- (Optional) Currently eitherchromeorfirefox. Chrome is the default if omitted.
Results
The result of the archiving operation is a JSON block. The block contains one of the following.
-
error: trueis set andmsgfield contains more details about the error. Thetypefield indicates a specific type of error, eg:type: blockedcurrently indicates the archiving service can not archive this page. -
queued: trueis the timeout for archiving the page (currently 30 secs) has been exceeded. If this is the case, the url has been put on a queue and the query should be retried until the page is archived.queue-posfield indicates the position in the queue, withqueue-pos: 1means the url is up next, andqueue-pos: 0means the url is currently being loaded in the browser. -
archived: trueis set if the archiving of the page has fully finished. The following additional properties may be set in the JSON result:-
replay_url- if the archived page is immediately available for replay, this is the url to access the archived content. -
download_url- if the archived content is available for download as a WARC file, this is the link to the WARC. -
actual_url- if the original url caused a redirect, this will contain the actual url that was archived (only present if different from original). -
browser_url- The actual url loaded by the browser to "seed" the archive. -
time- Timestamp of when the page was archived. -
ttl- time remaining (in seconds) for this entry to be stored in the cache. After the entry expires, a subsequent query will re-archive the page. Default is 10 min (600 secs) and can be configured inconfig.py -
logHTTP response log from the browser, available only in Chrome. The format is{<URL>: <STATUS>}for each url loaded to archive the current page.
-
Support
Initial work on this project was sponsored by the Hypothes.is Annotation Fund