Make Gentoo wiki zim
From @Popolechien on August 27, 2018 7:14
https://wiki.gentoo.org/wiki/Main_Page
Licensed under CC-by-SA 3.0 (request from OTRS)
Copied from original issue: openzim/mwoffliner#365
We have it already http://library.kiwix.org/installgentoo_en_all_2018-07/
From @Popolechien on September 3, 2018 6:43
It's a different one apparently. Says user: "It's a different wiki, installgentoo is a wiki that covers almost all GNU/Linux distributions that's based off of the "Install Gentoo" meme.
The Gentoo wiki (https://wiki.gentoo.org/wiki/Main_Page), is exclusively for Gentoo and is essential to installing Gentoo."
@ISNIT0 Fails like following
mwoffliner --mwUrl="https://wiki.gentoo.org/" --mwApiPath="/api.php" --adminEmail="[email protected]" --localParsoid --verbose
...
Getting redirects for article Overlay:Fkmclane...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Overlay%3AFkmclane&rawcontinue=...
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3AObject_libsandbox.so_from_LD_PRELOAD_cannot_be_preloaded&rawcontinue= (response code: 503).
Getting redirects for article GLEP:2...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=GLEP%3A2&rawcontinue=...
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3ACron_fails_to_load_in_root_crontab_with_message_ENTRYPOINT_FAILED&rawcontinue= (response code: 503).
Getting redirects for article GLEP:1...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=GLEP%3A1&rawcontinue=...
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3AChrooting_returns_exec_format_error&rawcontinue= (response code: 503).
Getting redirects for article GLEP:48...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=GLEP%3A48&rawcontinue=...
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3AOverriding_environment_variables_per_package&rawcontinue= (response code: 503).
Getting redirects for article GLEP:4...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=GLEP%3A4&rawcontinue=...
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3ANo_space_left_on_device_while_there_is_plenty_of_space_available&rawcontinue= (response code: 503).
Getting redirects for article GLEP:39...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=GLEP%3A39&rawcontinue=...
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3AInserting_base_module_in_module_store_fails_with_duplicate_declaration&rawcontinue= (response code: 503).
Getting redirects for article GLEP:3...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=GLEP%3A3&rawcontinue=...
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3APortage_fails_to_label_files_because_setfiles_does_not_work_anymore&rawcontinue= (response code: 503).
Getting redirects for article GLEP:5...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=GLEP%3A5&rawcontinue=...
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3AIs_swap_space_really_necessary&rawcontinue= (response code: 503).
Getting redirects for article GLEP:6...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=GLEP%3A6&rawcontinue=...
Unable to download content [1] https://wiki.gentoo.org//api.php?action=query&generator=allpages&gapfilterredir=nonredirects&gaplimit=max&colimit=max&prop=revisions|coordinates&gapnamespace=0&format=json&rawcontinue=&gapcontinue=GNU_Emacs (response code: 503).
Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&generator=allpages&gapfilterredir=nonredirects&gaplimit=max&colimit=max&prop=revisions|coordinates&gapnamespace=510&format=json&rawcontinue=&gapcontinue=Portage%2FMembership (response code: 503).
Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3AAll_available_memory_is_being_used&rawcontinue= (response code: 503).
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&generator=allpages&gapfilterredir=nonredirects&gaplimit=max&colimit=max&prop=revisions|coordinates&gapnamespace=510&format=json&rawcontinue=&gapcontinue=Portage%2FMembership (response code: 503).
Unable to download article ids: Error by retrieving https://wiki.gentoo.org//api.php?action=query&generator=allpages&gapfilterredir=nonredirects&gaplimit=max&colimit=max&prop=revisions|coordinates&gapnamespace=510&format=json&rawcontinue=&gapcontinue=Portage%2FMembership Error by retrieving https://wiki.gentoo.org//api.php?action=query&generator=allpages&gapfilterredir=nonredirects&gaplimit=max&colimit=max&prop=revisions|coordinates&gapnamespace=510&format=json&rawcontinue=&gapcontinue=Portage%2FMembership
From @ISNIT0 on September 18, 2018 9:22
The server is returning 503s. The urls themselves work, but it seems like the server is trying to avoid being scraped.
Thoughts? The response form the site is empty and a valid 503
I have moved the "gentoo" recipe which was scraping installgentoo.com to "installgentoo". This recipe does not work anymore because installgentoo has stop to provide its API. See https://farm.openzim.org/recipes/installgentoo/
I have created the recipe "gentoo" https://farm.openzim.org/recipes/gentoo for this request.
Done
Will it be updated to the current state?
I have created the recipe "gentoo" https://farm.openzim.org/recipes/gentoo for this request.
@benoit74
Reopening, since ZIM is still not yet available. Should probably be tried again now that years have passed once mwoffliner 1.14 is out (in the coming weeks)
Tried again, unfortunately this wiki does not have proper results on REST API which returns errors (looks like it is the same on all pages): https://wiki.gentoo.org/rest.php/v1/page/Main_Page/html
{"messageTranslations":{"en":"Unable to fetch Parsoid HTML"},"httpCode":500,"httpReason":"Internal Server Error"}
Note that https://github.com/openzim/mwoffliner/issues/2127 (not yet planned) might help at some point.
I made a new attempt with the ActionParse end-point https://farm.openzim.org/pipeline/33769a10-11ba-40a8-b024-d2cfee2f85e8/debug
It seem to me the Mediawiki is verry old with no support for either the Parsoid output or the Vector skin. @benoit74 You confirm?
https://github.com/openzim/mwoffliner/issues/2127 has been solved, but unfortunately this wiki is not using and not supporting the skins currently supported by the scraper (vector-legacy and vector-2022).
I've opened https://github.com/openzim/mwoffliner/issues/2337 for that matter.
The wiki also heavily relies on categories for navigation, which are not yet supported: https://github.com/openzim/mwoffliner/issues/2245 ; we have a PR ongoing, might be supported soon
@benoit74 I confirm that the scrape pass now (with dev), see https://farm.openzim.org/recipes/gentoo, but we have indeed a massive CSS/skin problem.
"massive" is maybe a big strong, but we indeed miss some CSS: https://github.com/openzim/mwoffliner/issues/2448
Renaming recipe to https://farm.openzim.org/recipes/wiki.gentoo.org_en_all