ArchiveBox icon indicating copy to clipboard operation
ArchiveBox copied to clipboard

Enhancement: Use the same URL layout as Archive.org for viewing ArchiveBox Snapshots `https://archive.org/web/<URL>`

Open pirate opened this issue 2 years ago • 3 comments

To visit an archived version of a website (or archive it automatically) on Archive.org, one can just visit http://web.archive.org/web/https://example.com/ and it will redirect to http://web.archive.org/web/20230116145642/https://example.com/ (or whatever the most recent snapshot timestamp is).

To really emobdy the tagline "ArchiveBox is a self-hosted version of archive.org" we should properly support their URL scheme too.

e.g.

  • https://demo.archivebox.io/web/https://example.com should redirect to the most recent snapshot https://demo.archivebox.io/web/20230116145642/https://example.com

    • note: support both the ArchiveBox-style timestamp in unix timestamp format e.g. 1673919713 or the Archive.org-style 20230116145642 format and truncated forms 2023, 202301, 20230116
    • note: also support visiting using snapshots using ulid uuid instead of timestamp as slug, e.g. https://demo.archivebox.io/01ARZ3NDEKTSV4RRFFQ69G5FAV/...
    • note: support auto prefix-matching slugs so that 2023 matches 202301, 20230116, 20230116145642 automatically, and 01AN4Z07BY matches 01AN4Z07BY79KA1307SR9X4MV3 automatically

    Full spec:

https://demo.archivebox.io/web/<SLUG> where SLUG can be: - an original URL, with or without scheme, e.g. https://example.com/index.html, 'example.com/index.html' ➡️ redirect to most recent snapshot for https://demo.archivebox.io/web/20230116145642/https://example.com/index.html - an ArchiveBox snapshot UUID in ulid/spec format 01AN4Z07BY79KA1307SR9X4MV3/index.html or timestamp prefix 01AN4Z07BY/index.html ➡️ redirect to that exact snapshot https://demo.archivebox.io/web/20230116145642/https://example.com/index.html - an ArchiveBox snapshot timestamp in YYMMDDHHMMSS, shortened forms like YYYYMM, or unix timestamp format e.g. 20230116145642/index.html or 202301161456/index.html, 202301/index.html, 1673919713/index.html ➡️ redirect to most recent snapshot matching that prefix https://demo.archivebox.io/web/20230116145642/https://example.com/index.html

Subtasks:

  • [x] adds derived ulid field + migration to coalesce old uuid and timestamp fields into new ulid format (+asserts all snapshot timestamps are valid and are between 1900 and 2100 AD) (done in v0.8.5)
  • [x] update admin and index UI to show ULID of old UUID4 xxxx-xxxx-xxxxxxx format, add ULID diagram in docs breaking it down into timestamp and randomness
  • [x] create disambiguation page to show all the matching results for a given SLUG if it's the prefix for multiple possible snapshots
  • [ ] reject Snapshot UUIDs being created that begin with 0, 1,2,htt to make prefix-matching faster and less error prone (avoids clashing with 199x*/20** year, 1* unix timestamp, 01* ULIDs, or http(s?) URL slug prefixes)
  • [ ] add docs examples on how to truly "self-host your own archive.org", add screenshot side-by-side of URL bar examples for visiting snapshots on Archive.org and demo.Archivebox.io
image

pirate avatar Jan 17 '23 02:01 pirate

At least one project interested in using ArchiveBox (Kicksecure) would also be interested in this functionality, or any functionality that allows turning a URL into an archived URL via a simple transformation (i.e., prepend https://archivebox.example.org/whatever/goes/here/ to a URL to get an archived URL). The use case for this is:

  • We have two very large MediaWiki instances, with wikis that contain many links to external websites.
  • For each of those links, we want to link to an archived version of the page we link to.
  • If the corresponding url for https://example.com/my-page is https://archivebox.example.com/BN833Z or something like that, there's no easy way to convert a link to an ArchiveBox link. Thus adding the archive links requires running a large "archive job" that archives all unarchived links, then gets the corresponding URLs and mass-edits them into the Wiki. This is a pain.
  • If the corresponding url for https://example.com/my-page is https://archivebox.example.com/web/https://example.com/my-page, no mass-editing is required. A MediaWiki plugin can be used to put a button after each link that offers an archived version of the webpage to the user. (This is what we already do with archive.org.)

Worthy of note, the format doesn't have to be exactly like archive.org for this to work. If ArchiveBox supported the MementoWeb API similar to how archive.today does, we would end up turning https://example.com/my-page into https://archivebox.example.com/timegate/https://example.com/my-page, which works just as well.

Is help wanted here? Depending on how suitable ArchiveBox is for Kicksecure's use case, this might be a feature we'd be willing to implement and work on upstreaming.

ArrayBolt3 avatar Nov 20 '24 00:11 ArrayBolt3

This is actually already supported 😃 It's just not well documented yet. You can visit:
https://archivebox.example.com/archive/https://example.com/archived/url e.g.:
https://demo.archivebox.io/archive/https://arstechnica.com/tech-policy/2024/10/the-internet-archive-and-its-916-billion-saved-webpages-are-back-online/

Note you can also put any identifier for a snapshot after /archive/ and it will redirect correctly, e.g.:

  • https://demo.archivebox.io/archive/<snapshot_timestamp>
  • https://demo.archivebox.io/archive/<snapshot_URL>
  • https://demo.archivebox.io/archive/<snapshot_UUID>
  • https://demo.archivebox.io/archive/<snapshot_ABID> (a new publicly sharable ID format added in >=v0.8.5 designed to make sharing snapshots between federated/distributed servers easier in future releases)

The REST API and admin pages for editing snapshots also allow fetching by any identifier (in >=v0.8.5):

  • https://archivebox.phantasm.group/admin/core/snapshot/<snapshot UUID> or <timestamp> or <ABID>
  • https://archivebox.phantasm.group/api/v1/core/snapshot/<snapshot UUID> or <timestamp> or <ABID> (using the URL is not supported for these yet because I don't think it's needed as much for admins/API users)

In all cases you can also provide just the first few characters of the identifier to do a prefix search for all matching snapshots, e.g. to see all snapshots for https://arstechnica.com/* you can visit: https://demo.archivebox.io/archive/https://arstechnica.com/ Image

We use this feature extensively with several of our paying clients who have similar needs as what you describe.

It's not fully compatible with archive.org / memento, but I have plans to make it cross-comaptible with both in the future which is what this ticket is meant to track.

pirate avatar Nov 20 '24 01:11 pirate

Oh nice! Thanks for the info!

ArrayBolt3 avatar Nov 20 '24 01:11 ArrayBolt3