zim-requests icon indicating copy to clipboard operation
zim-requests copied to clipboard

New ZIM: Manioc

Open barbayellow opened this issue 5 years ago • 7 comments

  • Website URL: http://www.manioc.org/
  • License: Non spécifié (consultation et le téléchargement en libre accès de plusieurs dizaines de milliers de documents) donc CC BY-NC-ND ?
  • Desired ZIM Title: Manioc
  • Desired ZIM Description: La bibliothèque numérique collaborative Manioc propose la consultation et le téléchargement en libre accès de plusieurs dizaines de milliers de documents anciens et contemporains, textuels, sonores, iconographiques et vidéos concernant les territoires et sociétés de la Caraïbe, de l'Amazonie, du Plateau des Guyanes et des régions et centres d'intérêt connexes.
  • Desired ZIM Icon –png (URL or attach one): image
  • Language (ISO 639-3): fra
  • Desired Main Page (homepage): n/a
  • Is this a MediaWiki?: no
  • Articles List URL (mediawiki): n/a

barbayellow avatar May 11 '20 15:05 barbayellow

@barbayellow Might be doable with Zimit

kelson42 avatar Jul 05 '20 08:07 kelson42

We are impacted by https://github.com/openzim/warc2zim/issues/71, blocking this ticket.

kelson42 avatar Dec 01 '20 11:12 kelson42

@kelson42 is this request still blocked ?

JulienMoraliBSF avatar Apr 19 '22 07:04 JulienMoraliBSF

@JulienMoraliBSF We shpuld again have a look, but pretty pessimist. I don't remember why we failed specificaly to scrape this web site with zimit... but it wad a hard problem.

kelson42 avatar Apr 19 '22 08:04 kelson42

@JulienMoraliBSF Found it! https://github.com/openzim/warc2zim/issues/71

kelson42 avatar Apr 19 '22 08:04 kelson42

@kelson42 thx for the update Just to make sure I understand the conclusion : we can't create the Zim right ?

JulienMoraliBSF avatar Apr 20 '22 14:04 JulienMoraliBSF

@JulienMoraliBSF It is not a definitibe no, but there is a serious technical burden.

kelson42 avatar Apr 20 '22 18:04 kelson42

This seems OK to proceed with zimit2 now, except that we need to develop a custom behavior to load all pages of resources.

See e.g. https://www.manioc.org/recherch/HASH256ee3a10e5a5515e58b9e, one needs to code a custom behavior to click all "next page" button to load all pages inside the ZIM.

This is probably the first thing to do: develop a custom behavior and try to ZIM only this single resource. Then, based on that we will have gained knowledge about the feasibility, technically speaking and also in terms of time needed to crawl all pages (I'm a bit concerned by the fact that there is 10k+ resources, and many are books of hundreds of page. Not sure web crawling this is really doable.

benoit74 avatar Nov 02 '24 20:11 benoit74

Just because this is not documented here, we have create a ZIM, see https://farm.openzim.org/recipes/manioc.org. But it seems that this is still not perfect if I read @benoit74.

kelson42 avatar Jul 06 '25 11:07 kelson42

ZIM is gone ...

benoit74 avatar Jul 06 '25 11:07 benoit74

@Popolechien @benoit74 What happened, I dont find a trace here in the repo abou that deletion?!

kelson42 avatar Jul 06 '25 12:07 kelson42

All storage was gone when Hetzner deleted our machine. And we do not backup non-prod ZIMs.

benoit74 avatar Jul 06 '25 13:07 benoit74

So we could redo it?

kelson42 avatar Jul 06 '25 19:07 kelson42

So we could redo it?

See https://github.com/openzim/zim-requests/issues/260#issuecomment-2453118292, probably yes, but not straightforward

benoit74 avatar Jul 07 '25 08:07 benoit74