mwoffliner icon indicating copy to clipboard operation
mwoffliner copied to clipboard

Categoriy HTML with sub categories and page members.

Open TheNetStriker opened this issue 7 months ago • 3 comments

Category pages are rendered with sub categories and page members. (See https://github.com/openzim/mwoffliner/issues/2245)

This change adds a new command line parameter getCategoryPages. When this is enabled the category pages get an HTML part with all the pages that are member of the category.

There are a few things that could be improved:

  • Titles for "Category", "Sub-Categories" and "Pages" could be localized.
  • I have only tested this with a small wiki. There could be problems when categories have a lot of pages or sub categories. Is there a way to paginate this?

TheNetStriker avatar Jun 03 '25 10:06 TheNetStriker

I just tested this. Here is what I found out:

  • it did not generated category pages, all links are broken The reason it did not generate the category pages is because of the articleList filter. I've build this into the regular download of pages, so it only downloads what's in the filter. If you set the filter to Panini_Comics,Categoria:Panini_Comics it should also download the category. But the page links were still broken on your example. As far as I've seen the link itself should be correct, but the browser adds /C/ in front of the page link and this causes the page link to fail. Do you have an idea how to fix this? You can look at the abstract.renderer.ts. I've just used the function encodeArticleIdForZimHtmlUrl(page.title) to generate the page links. Is there maybe a better function to generate the link?

  • the collapse of category div on Panini_Comics page is really weird I found out that this is only happening when the argument --forceRender ActionParse is set. This seems to change something in the css classes so that the section-heading css class is not working anymore. I'm not familar with this different renderer. Do you have an idea how this could be fixed?

  • do you have an example where the API is returning both categories and subcategories properties? To get subcategories of an category you need to use the list=categorymembers api call. Here is an example that loads all categories, subcategories and pages: https://it.wikipedia.org/w/api.php?action=query&list=categorymembers&format=json&cmlimit=max&cmtitle=Categoria:Panini_Comics&cmprop=title|sortkeyprefix|type&prop=redirects|revisions|pageimages|coordinates|categories&rdlimit=max&rdnamespace=0|14&redirects=true&formatversion=2&titles=Panini_Comics&colimit=max&cllimit=max&clshow=!hidden

  • why do you need two parameters getCategories and getCategoryPages? Why would someone want to retrieve categories but not category pages? I've implemented the new parameter because there are a lot more pages assigned to categories than there are categories assigned to other categories. This way the pages could be excluded from really big wikis. The biggest category that I found on Wikipedia has 2,325,761 pages assigned to it. Here is the link: https://en.wikipedia.org/w/index.php?title=Category:All_stub_articles

TheNetStriker avatar Jun 03 '25 16:06 TheNetStriker

@TheNetStriker I'm sorry but I miss sufficient bandwidth this week to analyze this PR. Will do me best next week.

benoit74 avatar Jun 13 '25 14:06 benoit74

I'm sorry, but I still don't get what you are trying to achieve with getCategoryPages setting. From what I read, this parameter is used to call setArticlePageMembers, which seems to be used to retrieve members of every categories. But this information is already retrieved without this parameters (the pages attribute in QueryRet from my understanding). Can you help me understand the difference?

Command I used:

docker run --rm --name mwoffliner_test -v $PWD/output:/output local-mwoffliner mwoffliner [email protected] --customZimDescription="Test" --customZimTitle=Test --filenamePrefix=tests_en_mwoffliner --format=nopic --mwUrl=https://bm.wikipedia.org --outputDirectory=/output --publisher=openZIM --verbose=log --webp --forceRender ActionParse --getCategories --getCategoryPages

I also don't get your point regarding pagination. To me the calls you've added are already paginated.

The UI issues you are seeing is linked to the fact that these categories must be adapted to the skin used, since we retrieve CSS linked to this skin.

The reason it did not generate the category pages is because of the articleList filter.

OK, we can maybe live with this limitation for now. I don't know. But this must be documented somewhere. And this is clearly not really convenient since it basically means than we cannot use --getCategories with --articleList since it is probably too complex for most users to find all categories to add to the articleList. Plus it is a bit deceptive to have a parameter named getCategories which don't get categories unless you've listed them in articleList...

Or maybe this is just an indication that current approach is wrong: in a naive approach, I would not have done categories fetching like we do articles, but rather walk the tree up from categories associated with one articles we have explored up to the top. But this is maybe naive / not working.

benoit74 avatar Jun 19 '25 15:06 benoit74

I guess, we cqn not move forward this PR, we should close it?

kelson42 avatar Jul 20 '25 11:07 kelson42

Let's close it (for now at least), it is not possible to merge, probably need significant redesign and we miss feedback from contributor.

benoit74 avatar Jul 20 '25 13:07 benoit74