wikiextractor icon indicating copy to clipboard operation
wikiextractor copied to clipboard

Extract page categories

Open zaemyung opened this issue 4 years ago • 4 comments

  • When --extract_categories is given, page categories are extracted and saved along with title and text.
  • Note that sortkeys are dropped.
    • e.g. [[Category:Category name|Sortkey]] -> Only Category name is extracted.

zaemyung avatar Oct 03 '19 07:10 zaemyung

  • Just realized that this has to be adapted to different languages as "Category" is English-specific.
  • Now added --category_surface to denote how Category should be written in the language of concern.
  • For example, for Russian:
    python3 wikiextractor/WikiExtractor.py \
        -b 50M --json --links --sections --lists --filter_disambig_pages --quiet \
        --output ${wiki_dumps_dir}/ruwiki-20190920 \
        --extract_categories --category_surface Категория \
        ${wiki_dumps_dir}/ruwiki-20190920-pages-articles.xml.bz2
    
  • When finding category_surface, make sure to go into the editing mode of a page to find out how it is actually written in the wiki markup language.

zaemyung avatar Oct 03 '19 13:10 zaemyung

Outstanding feature. Thanks!

Alessi0X avatar Mar 07 '20 13:03 Alessi0X

Yes, this is precisely what I needed for my use case. Huge thanks to you for implementing and publishing this.

I have a question about --category_surface flag you added. I noticed that category names appear to be hardcoded in other places such as at line 2825:

2825                 if line.lstrip().startswith('[[Category:'):

and in the definition of catRE regexp. Does it mean that category filtering will not work with non-English wikis?

bt2901 avatar Mar 12 '20 20:03 bt2901

@bt2901 Ah, I missed to see that the parser already extracts categories, albeit only for English, to decide whether to include the page for extraction or not. I could have extended that bit for non-English categories and save them as well. So, yeah, I don't think the filtering of pages would work for non-English wikis. I should probably fix this.

Nevertheless, the current code (this pull request) as is, should work to extract and save the non-English categories as well, since the process is conducted during the actual extraction of pages. I used the code for English, German, Russian, and Korean wiki dumps. But make sure to go into the each wiki page in edit mode, and check how the category is written in its surface form, and update the --category_surface accordingly.

zaemyung avatar Mar 13 '20 02:03 zaemyung