extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

remove Hidden_categories

Open VladimirAlexiev opened this issue 10 years ago • 1 comments

Filter out maintenance (hidden) categories and don't emit them in the dataset. These categories are useful only to Wikipedia maintainers and are not useful for content consumers.

  • https://en.wikipedia.org/wiki/Category:Hidden_categories has 15737 members (subcats) That's 1.5% of the total 1.1M cats in enwiki, but it's the proverbial spoon of dirt in a jar of honey.

Unfortunately DBpedia does not extract classification coming from templates (transclusion), see #378. Most hidden cats are marked in that way, so:

  • http://dbpedia.org/page/Category:Hidden_categories has only 7 subcats (skos:broader)

I think extracting from templates will be very hard to implement. Other possible sources:

SQL

All classifications are available with SQL, eg from http://quarry.wmflabs.org:

select page_title, page_id
from page, categorylinks 
where cl_to='Hidden_categories' and cl_from=page_id

Quarry has a timeout of 10 minutes, so isn't appropriate for large-scale querying. If you select SQL, you'd probably have to make a local copy of the DB

Wikipedia API

https://www.mediawiki.org/wiki/API:Categorymembers https://www.mediawiki.org/wiki/Special:ApiSandbox http://en.wikipedia.org/w/api.php?action=query&list=categorymembers&format=json&cmtitle=Category%3AHidden%20categories&cmprop=title%7Ctype&cmtype=page%7Csubcat&cmlimit=100

You could try different return formats:

  • xml: human-readable, 1 line per result
  • json: js data struct
  • php=yaml: php data struct
  • txt: indented text showing data struct
  • dbg: indented text showing data struct
  • wddx: xml similar to SPARQL results

Notes

  • doesn't return subcats (but all admin cats are direct members of Hidden_categories)
  • ns is always included so type (page vs subcat) is not needed

VladimirAlexiev avatar Apr 23 '15 08:04 VladimirAlexiev

There's a small difference between admin cats and hidden cats: https://en.wikipedia.org/wiki/Wikipedia:PROJCATS: "administration category..on article pages ..should be made a hidden category"

Nevertheless, I think Hidden_categories is the most precise way we got to find out which are admin cats.

VladimirAlexiev avatar Apr 23 '15 09:04 VladimirAlexiev