dweb-archive
dweb-archive copied to clipboard
Mediatype: texts - should use dweb instead of direct calls
The system currently supports text through Richards player, at some point it needs to work with that player to allow it to be decentralized.
Some previous notes on this: See Richard Caceres recent Slack chat of new version ~/git/ia_bookreader : https://github.com/internetarchive/bookreader
Brewster mentioned you were using bookreader, and I should let you know there's a new version that's much easier to use outside of IA ---- OLD notes --- Richard sent some links in Skype - need to either a) use Mek's IIIF reader b) use the bookreader, get the jSON file (could decentralzie) and then inside JSIA is a way to get the page.
view source here: https://archive.org/stream/10_PRINT_121114#page/n0/mode/2up
datafile: https://ia902603.us.archive.org/BookReader/BookReaderJSIA.php?id=10_PRINT_121114&itemPath=/4/items/10_PRINT_121114&server=ia902603.us.archive.org&format=jsonp&subPrefix=10_PRINT_121114&version=aHe9koCh&callback=jQuery11020015266682030218304_1515798833273&_=1515798833274
bookreader initialization library: https://archive.org/bookreader/BookReaderJSIA.js?v=aHe9koCh
image example: https://ia902603.us.archive.org/BookReader/BookReaderImages.php?zip=/4/items/10_PRINT_121114/10_PRINT_121114_jp2.zip&file=10_PRINT_121114_jp2/10_PRINT_121114_0000.jp2&scale=4&rotate=0(90 kB)
Notes from revisiting this? Open questions: [ ] How to view PDFs - and/or how to make the .jpg's
Research steps [ ] Look at https://github.com/internetarchive/bookreader BookReaderDemo/demo-simple.html and BookReaderJSSimple.js
What I've found ... Main json control file is : [https://openlibrary.org/query.json?type=/type/edition&*=&ocaid=zandvoort.newspapers.1992.zandvoorts.nieuwsblad&callback=jQuery110207786013323137531_1545886524531&_=1545886524532] which says its application/javascript but is actually application/json [ Question posed to Richard ] its not clear to me how to pass this to bookreader.
It contains urls like [https://ia802605.us.archive.org/BookReader/BookReaderImages.php?zip=/9/items/zandvoort.newspapers.1992.zandvoorts.nieuwsblad/1992.Zandvoorts.Nieuwsblad_jp2.zip&file=1992.Zandvoorts.Nieuwsblad_jp2/1992.Zandvoorts.Nieuwsblad_0000.jp2 ] for page0, its not clear to me if these are formulaic but probably doesnt' matter, but for dweb-mirror should be able to pull the zip, and then edit the URLs in the control file before passing to bookreader, for dweb-archive would also have to intercept where BookReader fetches these files.
THere is a strange URL [https://openlibrary.org/query.json?type=/type/edition&*=&ocaid=WillieLynchLetter1712&callback=jQuery11020018995238347655485_1545885427175&_=1545885427176] which says its application/json
but actually returns application/javascript
Options
- fetch PDF etc and view in an IFRAME - need to figure out supported formats
- get images as files - need to figure out how to find image urls like above and how to sync those and then pass to Bookreader
- get zipfile, and json, edit JSON to use local URLs and pass to bookreader and/or intercept where bookreader pulls the files. (latter would be hard/impossible as running in an iFrame)
I’m trying to figure out a strategy to do this in both the Dweb, or offline case, its tricky, in both cases.
For dweb.archive.org I think I have to ….
- Pull the metadata (via dweb as usual)
- Pull the JSON (via dweb)
- Have the gateway server push the images into IPFS etc, and Modify the JSON returned to point at those locations. (Non trivial)
- find the place in the book reader where it fetches files and have it go to DwebTransports with those dweb URLS
For dweb-mirror (offline) where there is a local server.
- Pull metadata and cache it
- Pull JSON (unmodified) from Archive but modify URLs just to strip the hostname before caching and passing to browser.
- Pull the Zip file and cache it on local server
- Either unzip the file, or find a npm module that can unzip one file at a time.
- Book reader will then access local server with URL it can interpret and return each file
Done: ./crawl.js --level all zandvoort.newspapers.1992.zandvoorts.nieuwsblad but it missed the big files (>700Mb for the zip)
(Note to self - see EN/Dweb - Archive - Text)
An example of a text item with multiple "books" try https://archive.org/details/ialerequestsummary Books are one page
EDITED: Background info: Multipage books thetaleofpeterra14838gut or alicesadventures19033gut are reasonably small but are displaying as a slide carousel [https://archive.org/search.php?query=mediatype:texts%20AND%20imagecount:8] shows small ones and unitednov65unit is an example
[ ] Figure out what switches slide carousel or bookreader
From Jeff Kaplan: typically if an item is mediatype=texts
and there is an abby and pdf file then it will result in a bookreader presentation. loose images would not result in a pdf or bookreader presentation. and an item with abby and pdf that is mediatype=texts would have no bookreader presentation. it would need to be mediatype=texts.
See - #109 for failure case (Peter Rabbit) that should use slide carousel