wikibase-dump-filter icon indicating copy to clipboard operation
wikibase-dump-filter copied to clipboard

filtering items with deprecated claims

Open rwst opened this issue 5 years ago • 2 comments

Applying the command bzcat latest-all.json.bz2 |wikibase-dump-filter --simplify --claim 'P698' |jq '[.id,.claims.P698,.claims.P921]' -c >PMID.ndjson results in >30M lines like this:

["Q94880466",["19484558"],null]
["Q17485067",["21609473"],["Q18123741","Q12156","Q193430"]]

where the first case is an item with P698 claim but without P921 claims, and the second has P698 and P921 claims. However out of these 30M there are at least six (6) that are different: ralf@ark:~/wikidata> grep '[]' PMID.ndjson

["Q30573040",["23057853"],[]]
["Q30523792",["22888462"],[]]
["Q48835971",[],null]
["Q50125628",[],null]
["Q58616403",[],null]
["Q31128925",["27613570"],[]]

Note that 3 don't have P698 (which should not happen given the filter), and 3 have [] instead of null for no P921.

I'm not claiming there is a bug in wikibase-dump-filter, just that this needs investigating, and the ticket is a start. But maybe you have seen this and have an immediate explanation?

rwst avatar May 28 '20 07:05 rwst

Ah got it, these were deprecated claims. Should they appear at all?

rwst avatar May 28 '20 07:05 rwst

the problem comes from this untested situation where you use both a --claim filter and --simplify:

  • the claim filter didn't care for ranks, and let the deprecated statements through
  • while the simplify function, by default, only keeps the truthy statements.

The later behavior can be disabled by passing a keepNonTruthy=true flag to the simplify function, but we could also consider having the filter checking the simplify option to know if it should drop or not a match due to a non-truthy statement

maxlath avatar May 28 '20 08:05 maxlath