robot icon indicating copy to clipboard operation
robot copied to clipboard

Proposal: add a system for overriding QC checks

Open matentzn opened this issue 4 years ago • 4 comments

ROBOT report checks are are the most important vehicle for rolling out Quality control across the OBO sphere. The most important use for ROBOT report IMO is in CI - that is to make sure, continuously that ontologies do not accidentally introduce errors during curation. For that reason, we try to use --fail-on ERROR as much as possible.

However, for many checks there are some corner cases that are not captured. Here are a few examples:

  • A human-readable definition starts with a lower case letter because the entity in question starts with a lower case letter (often with diseases).
  • In 99% of the cases we don't want more than one logical definition, but sometimes we do #733
  • In 99% we don't want duplicate scope synonyms, but sometimes we do
  • In 99% of the cases we wont to be sure to have super_classes, except for roots

At the moment, if we have even one such case, we cannot use "fail on error". Which is very unfortunate.

I have been starting to use a system for overwriting such checks which works as follows:

Term level exclusion.

We use an annotation assertion to exclude a term from a particular QC check. This is realised using an FILTER NOT EXISTS {} clause as follows:

FILTER NOT EXISTS {
       ?term robot:excluded_from_qc_check robot:duplicate_label_synonym.sparql
}

Basically, the ?term, which was matched by the check, gets excluded if it has the annotation robot:excluded_from_qc_check robot:duplicate_label_synonym.sparql asserted to it.

Statement level exclusion

This works the same for statement (triple) level exclusion (I didn't double check the syntax, but the idea should be clear):

FILTER NOT EXISTS {
?term a owl:Class .
  [ a owl:Axiom ;
    owl:annotatedSource ?term ;
    owl:annotatedProperty rdfs:subClassOf ;
    owl:annotatedTarget ?term2 ;
    robot:excluded_from_qc_check robot:duplicate_label_synonym.sparql
  ]
}

@jamesaoverton @beckyjackson What do you think? If you agree with this, I would start implementing it in my own time - no need for you to do anything.

matentzn avatar Apr 12 '21 11:04 matentzn

I wish this wasn't necessary, but I accept that it is.

@matentzn Have you considered making it work the other way around? At first glance, I wouldn't want to store this information in the OWL, but rather in a TSV in git. So I would want it to work like this:

  1. run ROBOT report to generate TSV; I decide that I want to exclude row 12
  2. copy the header and row 12 to a new "exclusion.tsv" file (or append row 12 to an existing exclusions file), and add it to the repo; I think we don't need the "level" column, but everything else is useful
  3. run ROBOT report again, specifying --exclude exclusion.tsv; the output should be the same as (1) except that row 12 would either be marked with level "SUPPRESSED" (or "HIDDEN" something) and sorted to the bottom; there should also be a count of the number of suppressed rows.

The implementation would require ROBOT to read the exclusion TSV file then filter out specified results before they're formatted in whatever output format. I don't really know how this would be implemented in detail, but if @matentzn thinks this design make sense we would ask @beckyjackson.

The main problem I see is that this wouldn't work if there's blank node for subject or object, or we'd have to handle blank nodes specially.

jamesaoverton avatar Apr 13 '21 20:04 jamesaoverton

Wow this is also a good idea.. In fact, this could be both easier to use, and would save a lot of QC annotations in the ontology.. I like it! I would say though that exclusion.txt should works solely of "Rule name" and "subject" - so basically you have two columns (or more, but these two are functional) named Rule name and Subject (like in the ROBOT template) and then some logic that excludes Rule X for subject Y. Cool! This also will keep SPARQL queries at a minimum complexity (which is important for runtime).

matentzn avatar Apr 14 '21 08:04 matentzn

Ok. Yes, I guess 'Rule name' and 'Subject' are enough. An optional 'Comment' is also a good idea. It should be clear when you look at the list why these checks are being excluded. It should also be slightly painful, so people don't just exclude stuff out of laziness.

@beckyjackson How would you suggest implementing this? Or if you have a better suggestion, let us know.

jamesaoverton avatar Apr 14 '21 12:04 jamesaoverton

Ultimately, lazyness wont help them because what will count on OBO level is what OBO Dashboard deems as exclusion criteria - so if they exclude on ontology level, this wont mean the dashboard suddenly turns green! This is just so that QC control is fine grained enough to allow people to configure there error levels properly.

matentzn avatar Apr 14 '21 12:04 matentzn