distributed-wikipedia-mirror
distributed-wikipedia-mirror copied to clipboard
ZIM mirror at download.wikipedia-on-ipfs.org
This is just a quick brain dump with initial idea, if there are some gaps in the plan, things to consider or known unknowns, let me know in comments.
Setting the scene
Kiwix project provides HTTP and BitTorrent links to ZIM files:
- http://wiki.kiwix.org/wiki/Content_in_all_languages
They provide rsync server for anyone interested in mirroring the data:
- https://download.kiwix.org/README
Produced mirror looks like this:
- https://download.kiwix.org/
go-ipfs has two special datastore types:
- filestore: Allows files to be added without duplicating the space they take up on disk.
- urlstore: Allows ipfs to retrieve blocks contents via a url instead of storing it in the datastore
Idea
Here is an idea: set up IPFS-backed mirror at download.wikipedia-on-ipfs.org
- [ ] import Kiwix data to IPFS
- we could rsync data to local machine, and then add it to IPFS
- use filestore to keep a single copy on disk, making rsync directory the source of truth
- or we could use urlstore to boostrap from existing HTTP mirrors
- or we could use badgerds and figure out best
ipfs addparameters / chunker settings to maximize deduplication (https://github.com/ipfs/distributed-wikipedia-mirror/issues/71)
- we could rsync data to local machine, and then add it to IPFS
- [ ] update DNSLink for
download.wikipedia-on-ipfs.org - [ ] (optional) update collaborative cluster (#68)
- [ ] update script responsible for generating
/wiki/Content_in_all_languagesto add IPFS CIDs- CID should be a link to content-addressed URL at a gateway (so IPFS-aware tools can upgrade transport to IPFS
- PATH → CID can be resolved via https://ipfs.io/api/v0/resolve API
Related
- Use ZIMs directly (https://github.com/ipfs/distributed-wikipedia-mirror/issues/42)