distributed-wikipedia-mirror icon indicating copy to clipboard operation
distributed-wikipedia-mirror copied to clipboard

Automate snapshot updates

Open lidel opened this issue 6 years ago • 7 comments

This is a placeholder issue. Will be updated with more details when we gain better understanding of what is needed here.

In the long run, we want to introduce CI/CD automation that does something along these lines:

  • read snapshot-hashes.yml to get a list of supported languages and latest snapshot dates
  • detect new snapshot for a language (eg. at https://wiki.kiwix.org/wiki/Wikipedia_in_all_languages)
  • build IPFS mirror and pins it to a dedicated ipfs cluster, so initial source is available
  • opens a PR against master with the new CID

Then, maintainer would review PR and merge it. Updating manifest in master would trigger an update of DNSLink under <lang>.wikipedia-on-ipfs.org, propagating change to collaborative cluster etc.

lidel avatar Sep 09 '19 11:09 lidel

@lidel For the updates, we start to advert and use our OPDS feed (which works like an atom feed). I would recommend to use that in the future. See https://wiki.kiwix.org/wiki/OPDS (still in beta).

kelson42 avatar Sep 09 '19 11:09 kelson42

@kelson42 thats sounds very useful! what would be a valid query to return the latest snapshot of english or turkish wiki?

Tried https://library.kiwix.org/catalog/search?lang=en&tag=wikipedia but it points at old snapshot: wikipedia_en_wp1-0.8_orig_2010-12.zim

lidel avatar Sep 09 '19 11:09 lidel

@lidel This feed delivers the most recent ZIM files... but a few or them are simply not newly generated. Let me know if you find a recent file which is not in it.

kelson42 avatar Sep 09 '19 11:09 kelson42

@kelson42 I think things like https://github.com/kiwix/kiwix-tools/issues/231 and https://github.com/kiwix/kiwix-tools/issues/316 need to land before we can use OPDS feed.

Right now, I was unable to come up with filters to get the latest English wikipedia with pictures and without video (wikipedia_en_all_novid)

  • Example 1:
    • https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_novid_2018-10.zim exists
    • https://library.kiwix.org/catalog/search?lang=en&tag=wikipedia&count=100 returns one result: wikipedia_en_wp1-0.8_orig_2010-12
  • Example 2
    • https://download.kiwix.org/zim/wikipedia/wikipedia_tr_all_novid_2019-06.zim exists
    • https://library.kiwix.org/catalog/search?lang=tr&tag=wikipedia&count=100 returns no results

Looking at https://download.kiwix.org/zim/wikipedia/ directly sounds like more robust solution atm.

lidel avatar Sep 10 '19 12:09 lidel

Right now, I was unable to come up with filters to get the latest English wikipedia with pictures and without video (wikipedia_en_all_novid)

In my solution I'm using a dynamic parser, which should solve that

https://github.com/ipfs/distributed-wikipedia-mirror/pull/40/files#diff-31235a619c2d46324cca9e5429d49b3cR106-R132

mkg20001 avatar Sep 10 '19 17:09 mkg20001

@lidel Looks like you have pretty well identified what needs to be done. An alternative would be to rely on https://download.kiwix.org/library/library_zim.xml (is is not dynamic like the OPDS feed, but easier to parse than HTML)... and more robust.

kelson42 avatar Sep 10 '19 17:09 kelson42

@kelson42 thats sounds very useful! what would be a valid query to return the latest snapshot of english or turkish wiki?

Tried https://library.kiwix.org/catalog/search?lang=en&tag=wikipedia but it points at old snapshot: wikipedia_en_wp1-0.8_orig_2010-12.zim

We need to be working of MWDumper.pl and the XML bz2 dataset from Wikipedia ... I will do an export to static HTML and collect the required code again, it's "known working".

I'd like to see more functionality here, we need "search and editing". Afaik there is not yet a good marriage of git or wiki and IPFS and it should be core to ... us.

alzinging avatar Dec 15 '22 07:12 alzinging