Crystal-Web-Archiver
Crystal-Web-Archiver copied to clipboard
Can run on remote server using X11 forwarding
Some large sites like YRE and KC can require the download of 2+ TB of content. That can be troublesome when my effective bandwidth cap per month is about 500 GB (0.5 TB). For sites like these, it may make sense to download them from a datacenter location rather than my usual location.
Sketch of how to use Crystal in a datacenter location:
- Create EC2 instance
- Create detachable EBS volume with hopefully enough space reserved to download the target site
- This EBS volume can be increased in size later if needed, with some time cost
- Install Crystal on EC2 instance. Setup X11 forwarding to my laptop so that I can see Crystal's UI locally.
- Launch Crystal. Start downloading site to EBS volume.
- If pause needed, stop the EC2 instance, retaining the EBS volume
- To view downloaded pages:
- Ensure can manually connect to HTTP server hosted by Crystal on remote EC2 instance, opening firewall ports as needed.
- [ ] Crystal will need to run its server on 0.0.0.0 rather than 127.0.0.1
- May need a preferences option to enable this behavior
- [ ] Crystal will need to run its server on 0.0.0.0 rather than 127.0.0.1
- Ensure can easily view downloaded page using the usual View button:
- [ ] Crystal will need to generate URLs pointing to the correct remote domain
- May need a preferences option to configure what the remote domain is
- [ ] Crystal should not try to open a webbrowser on the remote server
- May need the View button to display a clickable blue link instead of opening a web browser directly
- [ ] Crystal will need to generate URLs pointing to the correct remote domain
- Ensure can manually connect to HTTP server hosted by Crystal on remote EC2 instance, opening firewall ports as needed.
- Initiate upload of fully downloaded site to Glacier Deep Archive, using the usual s3cmd
In the future it may be desirable to add a TUI (Terminal UI) to Crystal so that it can be fully controlled over an SSH connection (without X11). However that would add a significant maintenance overhead to keep future changes to the GUI and TUI in sync.
Even with a TUI, special consideration will still need to be taken to actually view any downloaded pages.
EC2 Instance types that seem promising, with on-demand pricing, for 1-2¢/hr:
If I want to support long-running crawl processes in the future economically, EC2 Spot Instances have even better pricing, at the cost of requiring Crystal to understand & react to Spot Instance Interruption Notices and consider other Spot Instance Best Practices.
If I wanted to support distributed crawling economically with EC2 Spot Instances, reacting to Instance Rebalance Recommendations would also be a good idea.