remove Hidden_categories
Filter out maintenance (hidden) categories and don't emit them in the dataset. These categories are useful only to Wikipedia maintainers and are not useful for content consumers.
- https://en.wikipedia.org/wiki/Category:Hidden_categories has 15737 members (subcats) That's 1.5% of the total 1.1M cats in enwiki, but it's the proverbial spoon of dirt in a jar of honey.
Unfortunately DBpedia does not extract classification coming from templates (transclusion), see #378. Most hidden cats are marked in that way, so:
- http://dbpedia.org/page/Category:Hidden_categories has only 7 subcats (skos:broader)
I think extracting from templates will be very hard to implement. Other possible sources:
SQL
All classifications are available with SQL, eg from http://quarry.wmflabs.org:
select page_title, page_id
from page, categorylinks
where cl_to='Hidden_categories' and cl_from=page_id
Quarry has a timeout of 10 minutes, so isn't appropriate for large-scale querying. If you select SQL, you'd probably have to make a local copy of the DB
Wikipedia API
https://www.mediawiki.org/wiki/API:Categorymembers https://www.mediawiki.org/wiki/Special:ApiSandbox http://en.wikipedia.org/w/api.php?action=query&list=categorymembers&format=json&cmtitle=Category%3AHidden%20categories&cmprop=title%7Ctype&cmtype=page%7Csubcat&cmlimit=100
You could try different return formats:
- xml: human-readable, 1 line per result
- json: js data struct
- php=yaml: php data struct
- txt: indented text showing data struct
- dbg: indented text showing data struct
- wddx: xml similar to SPARQL results
Notes
- doesn't return subcats (but all admin cats are direct members of Hidden_categories)
- ns is always included so type (page vs subcat) is not needed
There's a small difference between admin cats and hidden cats: https://en.wikipedia.org/wiki/Wikipedia:PROJCATS: "administration category..on article pages ..should be made a hidden category"
Nevertheless, I think Hidden_categories is the most precise way we got to find out which are admin cats.