dweb-archive icon indicating copy to clipboard operation
dweb-archive copied to clipboard

Adding Web items (Wayback)

Open mitra42 opened this issue 6 years ago • 8 comments

General idea is to add Wayback items / web archives to dweb.archive.org

Non trivial as uses different formats etc to rest of archive.org

See https://github.com/mitra42/dweb-universal/issues/2

Notes follow ... See also Notebook pg 21

mitra42 avatar Aug 30 '18 22:08 mitra42

Overview (from call with Mark 2018-08-31 of IA internals.

  • Archives are in Warc files - essentially zips of zips
  • Live in Collections e.g. https://archive.org/details/liveweb
  • Multiple indexes - but essentially accessed as one called “cdx” approx 30TB in size with a ram version in REDIS for fast additions, a batch updated master, and a master master updated approx every 4 weeks.

Process starts with URL, looks up in CDX to get what captures we have, displays on UI User selects dates, then we get from WARC

On command line ([ ] need to find a place I can run this, it doesnt work on dweb.me) cdx http://www.google.com/ -p from=20180113 -p to=20180113 cdx http://www.google.com/ -p from=20180130 -p to=20180130 --fl timestamp

First finds where on Petabox, then what Warc file, then offset and range (into compressed zip.

mitra42 avatar Aug 30 '18 22:08 mitra42

Documentation … (from call with Mark 2018-08-31 of IA internals.

  • web.archive.org - bottom left tools > availability
  • https://archive.org/help/wayback_api.php covers availability / memento / CDX
  • http://ws-dl.blogspot.com/2013/07/2013-07-15-wayback-machine-upgrades.html super useful on memento
  • https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server for CDX

Also try

  • web.archive.org url, playback, look at “About this capture” can see individual objects.

mitra42 avatar Aug 30 '18 22:08 mitra42

Notes CDX - Can be expensive for full date range … may be large for popular sites like www.google.com or www.cnn.com but ok give volume of traffic anyway

mitra42 avatar Aug 30 '18 22:08 mitra42

Possible solutions …

Will need to take a page at a date and push that into Dweb, could either do IPFS of whole thing or IPFS of a new Warc made of the files needed. Latter is probably harder as would add duplication. Solution might be to feed each file into IPFS urlstore - remember IPFS hashes in gateway REDIS for now - return ipfs hash of HTML

mitra42 avatar Aug 30 '18 22:08 mitra42

Notes on Memento Memento Web is federated search on top of CDX. There I a service with an API http://timetravel.mementoweb.org searchs IA, British Library and few others, federated not decentralized http://timetravel.mementoweb.org/guide/api/ https://www.cs.odu.edu/~mln/ is the expert (Mark can intro)

mitra42 avatar Aug 30 '18 22:08 mitra42

Notes as try to get my head around this beast! FROM: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server

http://web.archive.org/cdx/search/cdx?url=mitra.biz&fastLatest=true&limit=-1&filter=statuscode:200

Gets the most recent successfull capture of mitra.biz

mitra42 avatar Jul 29 '19 20:07 mitra42

Notes from meeting with Kenji today .... https://archive.org/wayback/available?url=www.mitra.biz&statuslist=200,302 Gets URL of most recent 302/200 Today "url" is "http://web.archive.org/web/20190624190417/https://www.mitra.biz/" Via curl (not via browser) gets header curl -v -o- 'http://web.archive.org/web/20190624190417/https://www.mitra.biz/' Gets 302 to curl -v -o- http://web.archive.org/web/20190624190417/https://www.mitra.biz/index.html Gets html with links munged

Kenji has experimental headless browser service that returns the DOM once these are "played" will send me a URL

mitra42 avatar Aug 09 '19 21:08 mitra42

And here are some old notes form a slack convo in Nov

https://waybackrebuilder.com http://waybackdownloader.com http://www.waybackmachinedownloader.com/en/ https://www.waybackdownloads.com

https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server

mitra42 avatar Aug 15 '19 00:08 mitra42