mwoffliner icon indicating copy to clipboard operation
mwoffliner copied to clipboard

Add support for categories scraping

Open TheNetStriker opened this issue 8 months ago • 11 comments

I just noticed that since version 1.14.0 the --getCategories parameter is not accepted anymore.

My guess is that this was removed accidentally in this commit: https://github.com/openzim/mwoffliner/commit/cad7776adeece69d57da5f44f471cead1b060003

I tested this by just adding the line again directly in the parameterList.js file and then it worked. So the functionallty is still in there, only the parameter was removed.

Can you please add this parameter again?

TheNetStriker avatar Apr 16 '25 10:04 TheNetStriker

@kelson42 what is your PoV ? Having a WIP feature in a released product is kinda weird. But is it close to be OK ?

benoit74 avatar Apr 16 '25 12:04 benoit74

@benoit74 At the moment the getCategories function just converts content inside the category pages to html so they can be found and viewed using the search function.

Sub categories are not displayed at the moment and also the links to the category pages do not work because all the links point to /A/page, but the categories are in /U/page.

I always wanted this to work because I work a lot with category pages in my wiki. Sadly this newer got implemented. I could maybe take a look at this if I could implement those functions myself and create a pull request for it.

TheNetStriker avatar Apr 16 '25 13:04 TheNetStriker

@benoit74 Supporting the idea that we should not have the command line option if the feature is not working/completed. Which is the case.

kelson42 avatar Apr 16 '25 15:04 kelson42

@TheNetStriker you are welcomed to propose a PR that will make this really work correctly, and this will allow us to add the flag back.

benoit74 avatar Apr 17 '25 08:04 benoit74

@benoit74 @kelson42 I've just tested this with the latest main branch and for some reason the links to the category pages now work. I noticed that the format of the zim file changed. (Using zimdump) Maybe that is the reason why this works now.

The only thing for me that is missing would be that the subcategories are not displayed. The problem there seems to be that the subcategory html is not generated by the parse api, but by the mediawiki skin: https://www.mediawiki.org/wiki/Topic:R831w797j05frr4c

I guess the only way to do this would be to query the catageory members using a separate api call and generate the html in code and attach it to the category page. I will take a look at this if I could implement this myself.

What else would be missing for this to be released? I used this feature for years now and never had any problems.

TheNetStriker avatar Apr 17 '25 09:04 TheNetStriker

@TheNetStriker thank you for testing this! Is it possible you share with us the URL of a wiki where there is significant category usage so that we can test as well?

There has indeed been a significant change in ZIM format, and I now realize it problem helped solved some issue(s).

benoit74 avatar Apr 17 '25 10:04 benoit74

@benoit74 I'm using mwoffliner to create an offline version of my own private wiki, so I can't send you a link to this. But Wikipedia also has a lot of categories. (e.g. https://en.wikipedia.org/wiki/Category:Operating_systems)

I use the categories for e.g. applications so that when I write a new page for an application I can just assign the category and the application link appears automatically on the category page and I don't have to manually write links to every page.

TheNetStriker avatar Apr 17 '25 12:04 TheNetStriker

Yep, but I wanted to avoid going the wikipedia way since these wikis are really big. I will probably try to find another one.

benoit74 avatar Apr 17 '25 14:04 benoit74

@benoit74 I now found the time to take a look at this and I found a way that the sub categories and page members of categories are rendered as HTML in the zim file.

For the moment I only uploaded the changes to my fork. You can take a look at the changes here: https://github.com/TheNetStriker/mwoffliner/commit/bc27213505534dbe83e943f7260a913b7f8f4a64

I've completely removed the old categories.ts file and replaced it with api call's that query the sub categories and page members. I also added a new setting --getCategoryPageMembers so that page members can optionally be added.

The HTML is rendered with the HTML templates that already existed in the project.

Can you please take a look at this? There are almost certanly things that could be improved or that I did wrong because it is quite a complex project that was revised multiple times.

TheNetStriker avatar Jun 02 '25 17:06 TheNetStriker

Thanks a lot, please open a PR, this will make review significantly easier. I will try to review this week and advise.

benoit74 avatar Jun 03 '25 08:06 benoit74

@benoit74 I rebased the code with the latest version from the main branch and created a pull request: https://github.com/openzim/mwoffliner/pull/2334

TheNetStriker avatar Jun 03 '25 10:06 TheNetStriker