robot
robot copied to clipboard
Skip invalid lines when converting out of OBO
The Cellosaurus ontology contains many invalid lines, e.g. the following line has improperly escaped curly braces in the molecule's name:
comment: "Group: Patented cell line. Registration: International Depositary Authority, China Center for Type Culture Collection; CCTCC C2014222. Monoclonal antibody isotype: IgG1, kappa. Monoclonal antibody target: ChEBI; CHEBI:144925; 1-(4-methoxyphenyl)-2-{[4-(4-nitrophenyl)butan-2-yl]amino}ethanol (Phenylethylamine A)."
If you run robot convert -I https://ftp.expasy.org/databases/cellosaurus/cellosaurus.obo -o ~/Desktop/cellosaurus.json -vvv
and look very carefully for the relevant error (for now, you have to search the output for org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser
- #1038 would be helpful for this), you find that:
LINENO: 29219 - Missing '=' in trailing qualifier block. This might happen for not properly escaped '{', '}' chars in comments.
LINE: comment: "Monoclonal antibody isotype: IgG2a, kappa. Monoclonal antibody target: ChEBI; CHEBI:144925; Phenylethylamine A (1-(4-methoxyphenyl)-2-{[4-(4-nitrophenyl)butan-2-yl]amino}ethanol)." org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser.parse(OBOFormatOWLAPIParser.java:60)
uk.ac.manchester.cs.owl.owlapi.OWLOntologyFactoryImpl.loadOWLOntology(OWLOntologyFactoryImpl.java:220)
uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.actualParse(OWLOntologyManagerImpl.java:1254)
uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntology(OWLOntologyManagerImpl.java:1208)
uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntologyFromOntologyDocument(OWLOntologyManagerImpl.java:1165)
org.obolibrary.robot.IOHelper.loadOntology(IOHelper.java:531)
org.obolibrary.robot.IOHelper.loadOntology(IOHelper.java:417)
org.obolibrary.robot.IOHelper.loadOntology(IOHelper.java:298)
org.obolibrary.robot.CommandLineHelper.getInputOntology(CommandLineHelper.java:487)
org.obolibrary.robot.CommandLineHelper.updateInputOntology(CommandLineHelper.java:585)
This ontology doesn't do its curation in an open source way so it's difficult to communicate and help solve this issue. Further, I downloaded the file and started making fixes one at a time, but I have to re-run robot convert
on every step. It would be nice if there were a setting that allowed for invalid lines to be skipped on OBO parsing.
CC @AmosBairoch @lubianat
Update: this is the same underlying issue as https://github.com/ebi-chebi/ChEBI/issues/4273
Hmm.. I think this is outside of the scope of ROBOT.. If you want this to happen you have to go through https://github.com/owlcs/owlapi/issues/ or join the #obo-format
channel on OBO slack where @balhoff is currently thinking about prefix maps for OBO format and other fixes - he may be amenable to this. But a ROBOT issue per se this is not I don't think - if the raw data is broken, the tool cant be expected to deal with all eventualities, so I would simple run a grep -v
on the OBO file prior to parsing. If you agree, can you close the issue?
This exact issue is a problem with the currently released ChEBI OBO file: https://github.com/ebi-chebi/ChEBI/issues/4273
Rethinking this now: I could implement a "repair --obo-format" option that deals with the most frequent violations like multiple labels and multiple comments etc.. I would be open to this but it would have to be now!
Sorry, I now realise I discuss this here: https://github.com/ontodev/robot/issues/995 and that this (broken rows) is not possible at all right now without a major OWLAPI update.
This needs to be either added as an OWL API ticket, or oboformat.. https://github.com/owlcollab/oboformat/issues
I will close this now, as what ROBOT can do about this can be covered by #995