distributed-wikipedia-mirror icon indicating copy to clipboard operation
distributed-wikipedia-mirror copied to clipboard

ZIM mirror at download.wikipedia-on-ipfs.org

Open lidel opened this issue 5 years ago • 1 comments

This is just a quick brain dump with initial idea, if there are some gaps in the plan, things to consider or known unknowns, let me know in comments.

Setting the scene

Kiwix project provides HTTP and BitTorrent links to ZIM files:

  • http://wiki.kiwix.org/wiki/Content_in_all_languages

They provide rsync server for anyone interested in mirroring the data:

  • https://download.kiwix.org/README

Produced mirror looks like this:

  • https://download.kiwix.org/

go-ipfs has two special datastore types:

  • filestore: Allows files to be added without duplicating the space they take up on disk.
  • urlstore: Allows ipfs to retrieve blocks contents via a url instead of storing it in the datastore

Idea

Here is an idea: set up IPFS-backed mirror at download.wikipedia-on-ipfs.org

  • [ ] import Kiwix data to IPFS
    • we could rsync data to local machine, and then add it to IPFS
      • use filestore to keep a single copy on disk, making rsync directory the source of truth
    • or we could use urlstore to boostrap from existing HTTP mirrors
    • or we could use badgerds and figure out best ipfs add parameters / chunker settings to maximize deduplication (https://github.com/ipfs/distributed-wikipedia-mirror/issues/71)
  • [ ] update DNSLink for download.wikipedia-on-ipfs.org
  • [ ] (optional) update collaborative cluster (#68)
  • [ ] update script responsible for generating /wiki/Content_in_all_languages to add IPFS CIDs
    • CID should be a link to content-addressed URL at a gateway (so IPFS-aware tools can upgrade transport to IPFS
    • PATH → CID can be resolved via https://ipfs.io/api/v0/resolve API

Related

  • Use ZIMs directly (https://github.com/ipfs/distributed-wikipedia-mirror/issues/42)

lidel avatar Jan 21 '20 20:01 lidel