Retrieving the identifiers by occupation/instance of should be done against Wikidata dump
IMPORTANT: do not use this method if paged SPARQL queries work fine.
~~We are waiting for direct access to the Wikidata dumps in the VPS machine:~~ ~~https://phabricator.wikimedia.org/T209818~~
Workflow example: MusicBrainz musicians.
INPUT: NT (triples) dump, i.e., wikidatawiki/entities/latest-truthy.nt.bz2;
- [ ] get all sub-classes of musician
Q639669; - [ ] for each
sub-class+ musician, filter all subjects (QIDs) with predicate occupation (P106) and objectsub-class; - [ ] for each filtered subject, if there is a triple with predicate MusicBrainz ID (
P434), remove it from the set.
We are waiting for direct access to the Wikidata dumps in the VPS machine: https://phabricator.wikimedia.org/T209818
Task resolved:
ls /public/dumps/public/wikidatawiki/entities
Alternative SPARQL method discussed during WikiCite 2018: unwind subclass of recursion.
See https://etherpad.wikimedia.org/p/WikiCite18Day3sparql
One-shot BASH done
We finally opted for paged SPARQL, leaving this open as an extra feature.