soweego icon indicating copy to clipboard operation
soweego copied to clipboard

Retrieving the identifiers by occupation/instance of should be done against Wikidata dump

Open MaxFrax opened this issue 7 years ago • 5 comments

IMPORTANT: do not use this method if paged SPARQL queries work fine.

~~We are waiting for direct access to the Wikidata dumps in the VPS machine:~~ ~~https://phabricator.wikimedia.org/T209818~~

MaxFrax avatar Nov 23 '18 15:11 MaxFrax

Workflow example: MusicBrainz musicians. INPUT: NT (triples) dump, i.e., wikidatawiki/entities/latest-truthy.nt.bz2;

  • [ ] get all sub-classes of musician Q639669;
  • [ ] for each sub-class + musician, filter all subjects (QIDs) with predicate occupation (P106) and object sub-class;
  • [ ] for each filtered subject, if there is a triple with predicate MusicBrainz ID (P434), remove it from the set.

marfox avatar Nov 23 '18 16:11 marfox

We are waiting for direct access to the Wikidata dumps in the VPS machine: https://phabricator.wikimedia.org/T209818

Task resolved: ls /public/dumps/public/wikidatawiki/entities

marfox avatar Dec 12 '18 15:12 marfox

Alternative SPARQL method discussed during WikiCite 2018: unwind subclass of recursion. See https://etherpad.wikimedia.org/p/WikiCite18Day3sparql

marfox avatar Dec 14 '18 11:12 marfox

One-shot BASH done

marfox avatar Dec 17 '18 15:12 marfox

We finally opted for paged SPARQL, leaving this open as an extra feature.

marfox avatar May 06 '19 14:05 marfox