zim-requests icon indicating copy to clipboard operation
zim-requests copied to clipboard

Make Gentoo wiki zim

Open kelson42 opened this issue 7 years ago • 10 comments

From @Popolechien on August 27, 2018 7:14

https://wiki.gentoo.org/wiki/Main_Page

Licensed under CC-by-SA 3.0 (request from OTRS)

Copied from original issue: openzim/mwoffliner#365

kelson42 avatar Sep 18 '18 09:09 kelson42

We have it already http://library.kiwix.org/installgentoo_en_all_2018-07/

kelson42 avatar Sep 18 '18 09:09 kelson42

From @Popolechien on September 3, 2018 6:43

It's a different one apparently. Says user: "It's a different wiki, installgentoo is a wiki that covers almost all GNU/Linux distributions that's based off of the "Install Gentoo" meme.

The Gentoo wiki (https://wiki.gentoo.org/wiki/Main_Page), is exclusively for Gentoo and is essential to installing Gentoo."

kelson42 avatar Sep 18 '18 09:09 kelson42

@ISNIT0 Fails like following

mwoffliner --mwUrl="https://wiki.gentoo.org/" --mwApiPath="/api.php" --adminEmail="[email protected]" --localParsoid --verbose

...

Getting redirects for article Overlay:Fkmclane...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Overlay%3AFkmclane&rawcontinue=...
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3AObject_libsandbox.so_from_LD_PRELOAD_cannot_be_preloaded&rawcontinue= (response code: 503).
Getting redirects for article GLEP:2...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=GLEP%3A2&rawcontinue=...
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3ACron_fails_to_load_in_root_crontab_with_message_ENTRYPOINT_FAILED&rawcontinue= (response code: 503).
Getting redirects for article GLEP:1...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=GLEP%3A1&rawcontinue=...
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3AChrooting_returns_exec_format_error&rawcontinue= (response code: 503).
Getting redirects for article GLEP:48...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=GLEP%3A48&rawcontinue=...
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3AOverriding_environment_variables_per_package&rawcontinue= (response code: 503).
Getting redirects for article GLEP:4...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=GLEP%3A4&rawcontinue=...
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3ANo_space_left_on_device_while_there_is_plenty_of_space_available&rawcontinue= (response code: 503).
Getting redirects for article GLEP:39...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=GLEP%3A39&rawcontinue=...
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3AInserting_base_module_in_module_store_fails_with_duplicate_declaration&rawcontinue= (response code: 503).
Getting redirects for article GLEP:3...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=GLEP%3A3&rawcontinue=...
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3APortage_fails_to_label_files_because_setfiles_does_not_work_anymore&rawcontinue= (response code: 503).
Getting redirects for article GLEP:5...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=GLEP%3A5&rawcontinue=...
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3AIs_swap_space_really_necessary&rawcontinue= (response code: 503).
Getting redirects for article GLEP:6...
Downloading https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=GLEP%3A6&rawcontinue=...
Unable to download content [1] https://wiki.gentoo.org//api.php?action=query&generator=allpages&gapfilterredir=nonredirects&gaplimit=max&colimit=max&prop=revisions|coordinates&gapnamespace=0&format=json&rawcontinue=&gapcontinue=GNU_Emacs (response code: 503).
Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&generator=allpages&gapfilterredir=nonredirects&gaplimit=max&colimit=max&prop=revisions|coordinates&gapnamespace=510&format=json&rawcontinue=&gapcontinue=Portage%2FMembership (response code: 503).
Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&prop=redirects&format=json&rdprop=title&rdlimit=max&titles=Knowledge_Base%3AAll_available_memory_is_being_used&rawcontinue= (response code: 503).
Absolutely unable to retrieve async. URL: Unable to download content [3] https://wiki.gentoo.org//api.php?action=query&generator=allpages&gapfilterredir=nonredirects&gaplimit=max&colimit=max&prop=revisions|coordinates&gapnamespace=510&format=json&rawcontinue=&gapcontinue=Portage%2FMembership (response code: 503).
Unable to download article ids: Error by retrieving https://wiki.gentoo.org//api.php?action=query&generator=allpages&gapfilterredir=nonredirects&gaplimit=max&colimit=max&prop=revisions|coordinates&gapnamespace=510&format=json&rawcontinue=&gapcontinue=Portage%2FMembership Error by retrieving https://wiki.gentoo.org//api.php?action=query&generator=allpages&gapfilterredir=nonredirects&gaplimit=max&colimit=max&prop=revisions|coordinates&gapnamespace=510&format=json&rawcontinue=&gapcontinue=Portage%2FMembership

kelson42 avatar Sep 18 '18 09:09 kelson42

From @ISNIT0 on September 18, 2018 9:22

The server is returning 503s. The urls themselves work, but it seems like the server is trying to avoid being scraped.

Thoughts? The response form the site is empty and a valid 503

kelson42 avatar Sep 18 '18 09:09 kelson42

I have moved the "gentoo" recipe which was scraping installgentoo.com to "installgentoo". This recipe does not work anymore because installgentoo has stop to provide its API. See https://farm.openzim.org/recipes/installgentoo/

I have created the recipe "gentoo" https://farm.openzim.org/recipes/gentoo for this request.

kelson42 avatar Apr 18 '20 08:04 kelson42

Done

kelson42 avatar Dec 01 '20 18:12 kelson42

Will it be updated to the current state?

vitaly-zdanevich avatar Jun 18 '24 07:06 vitaly-zdanevich

I have created the recipe "gentoo" https://farm.openzim.org/recipes/gentoo for this request.

image

vitaly-zdanevich avatar Jun 18 '24 07:06 vitaly-zdanevich

@benoit74

vitaly-zdanevich avatar Jun 30 '24 17:06 vitaly-zdanevich

Reopening, since ZIM is still not yet available. Should probably be tried again now that years have passed once mwoffliner 1.14 is out (in the coming weeks)

benoit74 avatar Jul 01 '24 13:07 benoit74

Tried again, unfortunately this wiki does not have proper results on REST API which returns errors (looks like it is the same on all pages): https://wiki.gentoo.org/rest.php/v1/page/Main_Page/html

{"messageTranslations":{"en":"Unable to fetch Parsoid HTML"},"httpCode":500,"httpReason":"Internal Server Error"}

Note that https://github.com/openzim/mwoffliner/issues/2127 (not yet planned) might help at some point.

benoit74 avatar Jan 13 '25 13:01 benoit74

I made a new attempt with the ActionParse end-point https://farm.openzim.org/pipeline/33769a10-11ba-40a8-b024-d2cfee2f85e8/debug

It seem to me the Mediawiki is verry old with no support for either the Parsoid output or the Vector skin. @benoit74 You confirm?

kelson42 avatar Jun 07 '25 07:06 kelson42

https://github.com/openzim/mwoffliner/issues/2127 has been solved, but unfortunately this wiki is not using and not supporting the skins currently supported by the scraper (vector-legacy and vector-2022).

I've opened https://github.com/openzim/mwoffliner/issues/2337 for that matter.

The wiki also heavily relies on categories for navigation, which are not yet supported: https://github.com/openzim/mwoffliner/issues/2245 ; we have a PR ongoing, might be supported soon

benoit74 avatar Jun 07 '25 07:06 benoit74

@benoit74 I confirm that the scrape pass now (with dev), see https://farm.openzim.org/recipes/gentoo, but we have indeed a massive CSS/skin problem.

kelson42 avatar Jul 24 '25 19:07 kelson42

"massive" is maybe a big strong, but we indeed miss some CSS: https://github.com/openzim/mwoffliner/issues/2448

benoit74 avatar Jul 25 '25 07:07 benoit74

Renaming recipe to https://farm.openzim.org/recipes/wiki.gentoo.org_en_all

kelson42 avatar Jul 27 '25 09:07 kelson42