distributed-wikipedia-mirror
distributed-wikipedia-mirror copied to clipboard
Automate snapshot updates
This is a placeholder issue. Will be updated with more details when we gain better understanding of what is needed here.
In the long run, we want to introduce CI/CD automation that does something along these lines:
- read snapshot-hashes.yml to get a list of supported languages and latest snapshot dates
- detect new snapshot for a language (eg. at https://wiki.kiwix.org/wiki/Wikipedia_in_all_languages)
- build IPFS mirror and pins it to a dedicated ipfs cluster, so initial source is available
- opens a PR against
masterwith the new CID
Then, maintainer would review PR and merge it.
Updating manifest in master would trigger an update of DNSLink under <lang>.wikipedia-on-ipfs.org, propagating change to collaborative cluster etc.
@lidel For the updates, we start to advert and use our OPDS feed (which works like an atom feed). I would recommend to use that in the future. See https://wiki.kiwix.org/wiki/OPDS (still in beta).
@kelson42 thats sounds very useful! what would be a valid query to return the latest snapshot of english or turkish wiki?
Tried https://library.kiwix.org/catalog/search?lang=en&tag=wikipedia but it points at old snapshot: wikipedia_en_wp1-0.8_orig_2010-12.zim
@lidel This feed delivers the most recent ZIM files... but a few or them are simply not newly generated. Let me know if you find a recent file which is not in it.
@kelson42 I think things like https://github.com/kiwix/kiwix-tools/issues/231 and https://github.com/kiwix/kiwix-tools/issues/316 need to land before we can use OPDS feed.
Right now, I was unable to come up with filters to get the latest English wikipedia with pictures and without video (wikipedia_en_all_novid)
- Example 1:
- https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_novid_2018-10.zim exists
- https://library.kiwix.org/catalog/search?lang=en&tag=wikipedia&count=100 returns one result:
wikipedia_en_wp1-0.8_orig_2010-12
- Example 2
- https://download.kiwix.org/zim/wikipedia/wikipedia_tr_all_novid_2019-06.zim exists
- https://library.kiwix.org/catalog/search?lang=tr&tag=wikipedia&count=100 returns no results
Looking at https://download.kiwix.org/zim/wikipedia/ directly sounds like more robust solution atm.
Right now, I was unable to come up with filters to get the latest English wikipedia with pictures and without video (wikipedia_en_all_novid)
In my solution I'm using a dynamic parser, which should solve that
https://github.com/ipfs/distributed-wikipedia-mirror/pull/40/files#diff-31235a619c2d46324cca9e5429d49b3cR106-R132
@lidel Looks like you have pretty well identified what needs to be done. An alternative would be to rely on https://download.kiwix.org/library/library_zim.xml (is is not dynamic like the OPDS feed, but easier to parse than HTML)... and more robust.
@kelson42 thats sounds very useful! what would be a valid query to return the latest snapshot of english or turkish wiki?
Tried
https://library.kiwix.org/catalog/search?lang=en&tag=wikipediabut it points at old snapshot:wikipedia_en_wp1-0.8_orig_2010-12.zim
We need to be working of MWDumper.pl and the XML bz2 dataset from Wikipedia ... I will do an export to static HTML and collect the required code again, it's "known working".
I'd like to see more functionality here, we need "search and editing". Afaik there is not yet a good marriage of git or wiki and IPFS and it should be core to ... us.