soweego icon indicating copy to clipboard operation
soweego copied to clipboard

Find or propose a Wikidata property for confidence scores

Open marfox opened this issue 6 years ago • 3 comments

For probabilistic output, it would be optimal for the bot to add a qualifier with a float value, representing the confidence score of a given statement.

marfox avatar Mar 12 '19 09:03 marfox

To assess the possibility of reusing an existing qualifier property, I went through all of them using SPARQL query:

select *
where {
  ?p wdt:P31 wd:Q15720608;
     rdfs:label ?l;
     schema:description ?d.
  filter (lang(?l) = "en")
  filter (lang(?d) = "en")
  optional { ?p wikibase:propertyType ?t }
}

Relevant properties are:

id label description type
wd:P1107 proportion to be used as a qualifier, value must be between 0 and 1 wikibase:Quantity
wd:P4271 rating qualifier to indicate a score given by the referenced source indicating the quality or completeness of the statement wikibase:WikibaseItem
wd:P1480 sourcing circumstances qualification of the truth or accuracy of a source: circa (Q5727902), near (Q21818619), presumably (Q18122778), etc. wikibase:WikibaseItem
wd:P2571 uncertainty corresponds to number of standard deviations (sigma) expressing the confidence level of a value wikibase:WikibaseItem

For these qualifiers, I extracted the number of triples and the most frequent properties they are used with, and values they take, using query (replace three occurrences of P1107 with each one of the properties above):

select ?triples ?properties ?values {
  { select (count(*) as ?triples) { ?s pq:P1107 ?o } }
  { select (group_concat(?v; separator="; ") as ?properties) {
      {
        select ?p ?l (count(*) as ?n) {
          ?e ?p ?s . ?s pq:P4271 ?o .
          optional { ?pe wikibase:claim ?p ; rdfs:label ?l filter(lang(?l) = "en") }
        }
        group by ?p ?l order by desc(?n) limit 10
      }
      bind (concat(strafter(str(?p), "http://www.wikidata.org/prop/"),
            " (", ?l, " - ", str(?n), ")") as ?v)
    }
  }
  { select (group_concat(?v; separator="; ") as ?values) {
      {
        select ?o ?l (count(*) as ?n) {
          ?s pq:P1107 ?o .
          optional { ?o rdfs:label ?l filter(lang(?l) = "en") }
        }
        group by ?o ?l order by desc(?n) limit 10
      }
      bind (concat(coalesce(?l, str(?o)), " (", str(?n), ")") as ?v)
    }
  }
}
qualifier triples properties values
pq:P1107 9448 P1344 (participant of - 1); P444 (review score - 3); P3357 (negative diagnostic predictor - 6); P3358 (positive prognostic predictor - 76); P3356 (positive diagnostic predictor - 95); P3359 (negative prognostic predictor - 244); P3355 (negative therapeutic predictor - 619); P3354 (positive therapeutic predictor - 1016) 0.8 (56); 0.3 (57); 0.9 (65); 0.4 (67); 100 (68); 0.2 (83); 0.25 (88); 0.1 (97); 0.5 (253); 1 (3916)
pq:P4271 2060 P1344 (participant of - 1); P444 (review score - 3); P3357 (negative diagnostic predictor - 6); P3358 (positive prognostic predictor - 76); P3356 (positive diagnostic predictor - 95); P3359 (negative prognostic predictor - 244); P3355 (negative therapeutic predictor - 619); P3354 (positive therapeutic predictor - 1016) UEFA stadium categories (1); D (1); Charity Navigator four-star rating (2); CIViC 1-star trust rating (96); CIViC 5-star trust rating (104); CIViC 4-star trust rating (469); CIViC 2-star trust rating (497); CIViC 3-star trust rating (890)
pq:P1480 63354 P19 (place of birth - 329); P1014 (AAT ID - 343); P2044 (elevation above sea level - 486); P170 (creator - 585); P276 (location - 662); P2031 (work period (start) - 705); P31 (instance of - 837); P570 (date of death - 4762); P569 (date of birth - 11954); P571 (inception - 37547) attribution (62); unspecified calendar (186); fiscal year (259); possibly (452); possibly approximate value (503); hierarchical link is not direct (550); disputed (758); near (820); presumably (3124); circa (55854)
pq:P2571 11233 P2374 (natural abundance - 1); P2201 (electric dipole moment - 1); P1855 (Wikidata property example - 1); P577 (publication date - 2); P2102 (boiling point - 8); P2101 (melting point - 9); P2114 (half-life - 905); P2160 (mass excess - 3435); P2067 (mass - 3435); P2154 (binding energy - 3436) Long Term Evolution (1); 2 sigma (1); 8 (1); 1 (1); 5 (2); expanded uncertainty (15); standard deviation (11212)

Based on the tables above, it seems that:

  • wd:P1107 'proportion' is the only property accepting a quantity value (and the 0-1 range would be perfect for us), but it is essentially used to express percentage of possession / composition.
  • wd:P4271 'rating' is used exclusively with properties and 5-star rating values related to the CIViC database (a resource for Clinical Interpretation of Variants in Cancer)
  • wd:P1480 'sourcing circumstances' is also defined as 'accuracy', 'reliability', 'confidence', 'precision', 'certainty', 'validity', 'qualitative valuation', all terms that closely match our needs. It is used however with a variety of properties, i.e., it appears to be domain-general, which is good for us. However, it is used with categorical values whose meaning is rather fuzzy.
  • wd:P2571 'uncertainty corresponds to' has a very precise definition (number of stddev), but unintuitively it takes an Item value, and in almost all cases that value is the constant 'standard deviation'. Besides, it is applied to numerical / date properties, for which standard deviation makes sense.

Summing up, none of the properties above seems reusable as it is. We can probably propose a variation of one of them, and especially wd:P4271 or wd:P1480.

fracorco avatar Mar 12 '19 12:03 fracorco

Thanks a lot @fracor for the thorough analysis, much appreciated. My understanding is that none of the existing properties you listed fit our use case.

I suggest to go for a property proposal. Any suggestions for the label of the new property are welcome.

marfox avatar Mar 13 '19 16:03 marfox

label: confidence score description: a score interpretable as a probability estimate (from 0 to 1) given by the referenced source indicating the quality of the statement.

Remper avatar Mar 13 '19 16:03 Remper