wikibase-dump-filter
wikibase-dump-filter copied to clipboard
filtering items with deprecated claims
Applying the command bzcat latest-all.json.bz2 |wikibase-dump-filter --simplify --claim 'P698' |jq '[.id,.claims.P698,.claims.P921]' -c >PMID.ndjson results in >30M lines like this:
["Q94880466",["19484558"],null]
["Q17485067",["21609473"],["Q18123741","Q12156","Q193430"]]
where the first case is an item with P698 claim but without P921 claims, and the second has P698 and P921 claims. However out of these 30M there are at least six (6) that are different: ralf@ark:~/wikidata> grep '[]' PMID.ndjson
["Q30573040",["23057853"],[]]
["Q30523792",["22888462"],[]]
["Q48835971",[],null]
["Q50125628",[],null]
["Q58616403",[],null]
["Q31128925",["27613570"],[]]
Note that 3 don't have P698 (which should not happen given the filter), and 3 have [] instead of null for no P921.
I'm not claiming there is a bug in wikibase-dump-filter, just that this needs investigating, and the ticket is a start. But maybe you have seen this and have an immediate explanation?
Ah got it, these were deprecated claims. Should they appear at all?
the problem comes from this untested situation where you use both a --claim filter and --simplify:
- the claim filter didn't care for ranks, and let the deprecated statements through
- while the simplify function, by default, only keeps the truthy statements.
The later behavior can be disabled by passing a keepNonTruthy=true flag to the simplify function, but we could also consider having the filter checking the simplify option to know if it should drop or not a match due to a non-truthy statement