Crystal-Web-Archiver icon indicating copy to clipboard operation
Crystal-Web-Archiver copied to clipboard

Can run on remote server using X11 forwarding

Open davidfstr opened this issue 5 months ago • 2 comments

Some large sites like YRE and KC can require the download of 2+ TB of content. That can be troublesome when my effective bandwidth cap per month is about 500 GB (0.5 TB). For sites like these, it may make sense to download them from a datacenter location rather than my usual location.

Sketch of how to use Crystal in a datacenter location:

  • Create EC2 instance
  • Create detachable EBS volume with hopefully enough space reserved to download the target site
    • This EBS volume can be increased in size later if needed, with some time cost
  • Install Crystal on EC2 instance. Setup X11 forwarding to my laptop so that I can see Crystal's UI locally.
  • Launch Crystal. Start downloading site to EBS volume.
    • If pause needed, stop the EC2 instance, retaining the EBS volume
    • To view downloaded pages:
      • Ensure can manually connect to HTTP server hosted by Crystal on remote EC2 instance, opening firewall ports as needed.
        • [ ] Crystal will need to run its server on 0.0.0.0 rather than 127.0.0.1
          • May need a preferences option to enable this behavior
      • Ensure can easily view downloaded page using the usual View button:
        • [ ] Crystal will need to generate URLs pointing to the correct remote domain
          • May need a preferences option to configure what the remote domain is
        • [ ] Crystal should not try to open a webbrowser on the remote server
          • May need the View button to display a clickable blue link instead of opening a web browser directly
  • Initiate upload of fully downloaded site to Glacier Deep Archive, using the usual s3cmd

davidfstr avatar Jan 21 '24 15:01 davidfstr

In the future it may be desirable to add a TUI (Terminal UI) to Crystal so that it can be fully controlled over an SSH connection (without X11). However that would add a significant maintenance overhead to keep future changes to the GUI and TUI in sync.

Even with a TUI, special consideration will still need to be taken to actually view any downloaded pages.

davidfstr avatar Jan 21 '24 16:01 davidfstr

EC2 Instance types that seem promising, with on-demand pricing, for 1-2¢/hr: Screen Shot 2024-03-11 at 8 39 36 AM

If I want to support long-running crawl processes in the future economically, EC2 Spot Instances have even better pricing, at the cost of requiring Crystal to understand & react to Spot Instance Interruption Notices and consider other Spot Instance Best Practices.

If I wanted to support distributed crawling economically with EC2 Spot Instances, reacting to Instance Rebalance Recommendations would also be a good idea.

davidfstr avatar Mar 11 '24 12:03 davidfstr