distributed-wikipedia-mirror icon indicating copy to clipboard operation
distributed-wikipedia-mirror copied to clipboard

Question: Use directly ZIM files?

Open kelson42 opened this issue 7 years ago • 10 comments

Would that be possible without having to unpack it in millions of small single files? Would a soft like kiwix-serve (or any other ZIM reader) be able to serve content through IPFS?

More information about kiwix-serve: http://wiki.kiwix.org/wiki/Kiwix-serve

kelson42 avatar Jun 11 '17 08:06 kelson42

This could be very useful, because (with appropriate client) I think local ZIM files (fetched from IPFS) can be searched, etc?

singpolyma avatar Oct 02 '17 16:10 singpolyma

cross-post from https://github.com/derhuerst/build-wikipedia-feed/issues/2#issue-261487197 :

Have you looked into getting the images from articles as well? From what I understand Kiwix has distributions where they bundle images in (for example wikipedia_en_all_2016-12.zim is 62GB)... I actually wrote a parser for their format recently https://www.npmjs.com/package/zimmer but havent finished doing the actual import-to-dat part :P

derhuerst avatar Oct 02 '17 16:10 derhuerst

@lidel Would be really interested to refresh that ticket. Let me know if you want to discuss this. Pretty motivated to help this project go forward.

kelson42 avatar Sep 09 '19 11:09 kelson42

Potential path to using ZIM directly

Publishing ZIM on IPFS

Something we could start doing today, is publishing .zim files on IPFS. IPFS CID could be listed along with HTTP and Bitorrent ones. It would act as a distributed CDN: files could be accessed via a local node (ipfs-desktop / go-ipfs) or one of public HTTP Gateways.

@kelson42 Is this something Kiwix would be interested in trying to do? fyi there are two experimental ways of adding data to IPFS without duplicating it on disk: ipfs-filestore and ipfs-urlstore – may be useful when introducing IPFS to existing infrasturture.

Web-based reader

This one is a long shot, but opens exciting possibilities: if a ZIM reader in pure JS existed, a web browser would be the only software required to browse distributed mirror. The reader would be just a set of static HTML+CSS+JS files published on IPFS along with ZIM archives, making it a self-contained proposition.

While it would be possible to read ZIM from HTTP Gateway via range requests, a more decentralized option would be to run embedded JS IPFS node on a page and request specific byte ranges via something like

ipfs.cat('/ipfs/QmHash/wiki.zim', { offset: x, length: y })

Question: how feasible this is from the perspective of existing ZIM readers? Is there any prior art for JS one, apart from zimmer?

Concerns / Unknowns

I am new to this endeavor, but from quick eyeballing it looks like ZIM is a flat file optimized for random seeks within. At some level it is similar to files put on IPFS: they get chunked and produced blocks are assembled into balanced trees (Mergle-DAGs) optimized for random access. Not sure how performance would look like if we put ZIM on IPFS and try to fetch it over the network, but we can experiment with this.

Data deduplication (dedicated chunker for ZIM?)

Potential problem with using ZIM directly is deduplication. When we put unpacked mirror on IPFS, a lot of data does not change between snapshots. All media assets such as images, audio files etc get deduplicated across entire IPFS swarm (all snapshots, all websites using the same image are cohosting it).

iiuc ZIM does internal compression of Clusters (>1MB) of data, which means each ZIM file is a different stream of bytes, defeating deduplication provided by IPFS.

My understanding is that good deduplication is not possible, unless ZIM Cluster compression is deterministic across snapshots (always compresses same assets, and compressing same assets produces exactly the same array of bytes) AND/OR we add ZIM to IPFS using custom chunker, that is aware of its internal structure, enabling deduplication of the same content across snapshots. This could also be neat demo of what is possible with https://ipld.io

Update: created https://github.com/ipfs/distributed-wikipedia-mirror/issues/71 to benchmark the level of deduplication we can get with regular ipfs add + some custom parameters

Please let me know if I missed something here.

lidel avatar Sep 09 '19 13:09 lidel

@lidel Distributing ZIM files via IPFS would be interesting and I would volunteer to make it if the process is not too complex.

We have two ZIM readers in Javascript:

  • One binding library https://github.com/openzim/libzim
  • One pure Javascript reader (mostly distributed as extension for Chrome and Firefox) https://github.com/kiwix/kiwix-js

But so far Kiwix-JS is not able to read a ZIM file online, see https://github.com/kiwix/kiwix-js/issues/356

But what I had in mind originally was to provide a server side service (so not just HTML files) able to read the ZIM files on demand and provide the content via IPFS. This would simplify the publication process by avoiding the data extraction process from the ZIM files. Not sure this is technically possible.

kelson42 avatar Sep 09 '19 14:09 kelson42

Another possibility would be to compile https://github.com/dignifiedquire/zim to webassembly to use in a browser

eminence avatar Sep 09 '19 15:09 eminence

@eminence This is also basically possible with the libzim/libkiwix as well... but looks like the result is not able to handle files over 4GB :(

kelson42 avatar Sep 09 '19 15:09 kelson42

Hi friends, I've published a draft of a devgrant for adding IPFS support to kiwix-js: https://github.com/ipfs/devgrants/pull/49 Readable version: https://github.com/ipfs/devgrants/blob/devgrant/kiwix-js/targeted-grants/kiwix-js.md

It tries to define steps to have kiwix-js reading Wikipedia .zim archives from IPFS. Right now we are looking for people with bandwidth and interest in creating PoC to test feasibility of that approach. Feel free to comment on https://github.com/ipfs/devgrants/pull/49

lidel avatar Apr 29 '20 11:04 lidel

(quick update on the state of things for drive-by reader)

The current process of unpacking ZIM and tweaking HTML on per-case basis is a very, very wasteful, impossible to automate across different languages and not sustainable. Every time something breaks, and we need to sink a lot of time to fix the build scripts – if we allocated that time into web-based ZIM reader we would already be there.

I believe our effort should go into putting ZIM on IPFS and then reading them from IPFS via web browser (as I elaborated in the original idea draft + we read in the latest research tracked in past in https://github.com/kiwix/kiwix-js/issues/595 and continued now in https://github.com/kiwix/kiwix-js/issues/659).

lidel avatar Jan 25 '21 17:01 lidel

Related: low-hanging optimization for future builds may be adding ZIM file to IPFS using trickle-dag (ipfs add --trickle) which is optimized for random seeking (default is optimized for reduction of link count).

IIUC this should improve use case when ZIM is read over HTTP range requests (if we talk to a public gateway or use preload servers). Nah, not really. We just need to use ZIMs directly.

tl;dr we need web-based reader for ZIM archives: https://github.com/ipfs/devgrants/blob/devgrant/kiwix-js/targeted-grants/kiwix-js.md

NOTE: the link above is old devgrant, we now can do it better with

lidel avatar Apr 12 '21 22:04 lidel

A lot changed since we've looked into this. Many new oportunities and protocols exist now, that did not before. Direct use of ZIMs continues in fresh issue https://github.com/ipfs/distributed-wikipedia-mirror/issues/140

lidel avatar Oct 12 '23 19:10 lidel