ArchiveBox
ArchiveBox copied to clipboard
Enhancement: Use the same URL layout as Archive.org for viewing ArchiveBox Snapshots `https://archive.org/web/<URL>`
To visit an archived version of a website (or archive it automatically) on Archive.org, one can just visit http://web.archive.org/web/https://example.com/ and it will redirect to http://web.archive.org/web/20230116145642/https://example.com/ (or whatever the most recent snapshot timestamp is).
To really emobdy the tagline "ArchiveBox is a self-hosted version of archive.org" we should properly support their URL scheme too.
e.g.
-
https://demo.archivebox.io/web/https://example.comshould redirect to the most recent snapshothttps://demo.archivebox.io/web/20230116145642/https://example.com- note: support both the ArchiveBox-style timestamp in unix timestamp format e.g.
1673919713or the Archive.org-style20230116145642format and truncated forms2023,202301,20230116 - note: also support visiting using snapshots using ulid uuid instead of timestamp as slug, e.g.
https://demo.archivebox.io/01ARZ3NDEKTSV4RRFFQ69G5FAV/... - note: support auto prefix-matching slugs so that
2023matches202301,20230116,20230116145642automatically, and01AN4Z07BYmatches01AN4Z07BY79KA1307SR9X4MV3automatically
Full spec:
- note: support both the ArchiveBox-style timestamp in unix timestamp format e.g.
https://demo.archivebox.io/web/<SLUG> where SLUG can be:
- an original URL, with or without scheme, e.g. https://example.com/index.html, 'example.com/index.html' ➡️ redirect to most recent snapshot for https://demo.archivebox.io/web/20230116145642/https://example.com/index.html
- an ArchiveBox snapshot UUID in ulid/spec format 01AN4Z07BY79KA1307SR9X4MV3/index.html or timestamp prefix 01AN4Z07BY/index.html ➡️ redirect to that exact snapshot https://demo.archivebox.io/web/20230116145642/https://example.com/index.html
- an ArchiveBox snapshot timestamp in YYMMDDHHMMSS, shortened forms like YYYYMM, or unix timestamp format e.g. 20230116145642/index.html or 202301161456/index.html, 202301/index.html, 1673919713/index.html ➡️ redirect to most recent snapshot matching that prefix https://demo.archivebox.io/web/20230116145642/https://example.com/index.html
Subtasks:
- [x] adds derived
ulidfield + migration to coalesce old uuid and timestamp fields into new ulid format (+asserts all snapshot timestamps are valid and are between 1900 and 2100 AD) (done in v0.8.5) - [x] update admin and index UI to show ULID of old UUID4
xxxx-xxxx-xxxxxxxformat, add ULID diagram in docs breaking it down into timestamp and randomness - [x] create disambiguation page to show all the matching results for a given SLUG if it's the prefix for multiple possible snapshots
- [ ] reject Snapshot UUIDs being created that begin with
0,1,2,httto make prefix-matching faster and less error prone (avoids clashing with199x*/20**year,1*unix timestamp,01*ULIDs, orhttp(s?)URL slug prefixes) - [ ] add docs examples on how to truly "self-host your own archive.org", add screenshot side-by-side of URL bar examples for visiting snapshots on Archive.org and demo.Archivebox.io
At least one project interested in using ArchiveBox (Kicksecure) would also be interested in this functionality, or any functionality that allows turning a URL into an archived URL via a simple transformation (i.e., prepend https://archivebox.example.org/whatever/goes/here/ to a URL to get an archived URL). The use case for this is:
- We have two very large MediaWiki instances, with wikis that contain many links to external websites.
- For each of those links, we want to link to an archived version of the page we link to.
- If the corresponding url for
https://example.com/my-pageishttps://archivebox.example.com/BN833Zor something like that, there's no easy way to convert a link to an ArchiveBox link. Thus adding the archive links requires running a large "archive job" that archives all unarchived links, then gets the corresponding URLs and mass-edits them into the Wiki. This is a pain. - If the corresponding url for
https://example.com/my-pageishttps://archivebox.example.com/web/https://example.com/my-page, no mass-editing is required. A MediaWiki plugin can be used to put a button after each link that offers an archived version of the webpage to the user. (This is what we already do with archive.org.)
Worthy of note, the format doesn't have to be exactly like archive.org for this to work. If ArchiveBox supported the MementoWeb API similar to how archive.today does, we would end up turning https://example.com/my-page into https://archivebox.example.com/timegate/https://example.com/my-page, which works just as well.
Is help wanted here? Depending on how suitable ArchiveBox is for Kicksecure's use case, this might be a feature we'd be willing to implement and work on upstreaming.
This is actually already supported 😃 It's just not well documented yet. You can visit:
https://archivebox.example.com/archive/https://example.com/archived/url e.g.:
https://demo.archivebox.io/archive/https://arstechnica.com/tech-policy/2024/10/the-internet-archive-and-its-916-billion-saved-webpages-are-back-online/
Note you can also put any identifier for a snapshot after /archive/ and it will redirect correctly, e.g.:
https://demo.archivebox.io/archive/<snapshot_timestamp>https://demo.archivebox.io/archive/<snapshot_URL>https://demo.archivebox.io/archive/<snapshot_UUID>https://demo.archivebox.io/archive/<snapshot_ABID>(a new publicly sharable ID format added in >=v0.8.5 designed to make sharing snapshots between federated/distributed servers easier in future releases)
The REST API and admin pages for editing snapshots also allow fetching by any identifier (in >=v0.8.5):
https://archivebox.phantasm.group/admin/core/snapshot/<snapshot UUID> or <timestamp> or <ABID>https://archivebox.phantasm.group/api/v1/core/snapshot/<snapshot UUID> or <timestamp> or <ABID>(using the URL is not supported for these yet because I don't think it's needed as much for admins/API users)
In all cases you can also provide just the first few characters of the identifier to do a prefix search for all matching snapshots, e.g. to see all snapshots for https://arstechnica.com/* you can visit: https://demo.archivebox.io/archive/https://arstechnica.com/
We use this feature extensively with several of our paying clients who have similar needs as what you describe.
It's not fully compatible with archive.org / memento, but I have plans to make it cross-comaptible with both in the future which is what this ticket is meant to track.
Oh nice! Thanks for the info!