robot icon indicating copy to clipboard operation
robot copied to clipboard

Standard TSV exports for ontologies

Open cmungall opened this issue 5 years ago • 28 comments

Many people like to browse ontologies as TSVs in excel etc, or to do programmatic operations over tabular serializations (RDBMSs, Pandas, R data frames, perl hacks, unix grep/sort etc).

It's fairly trivial to write custom SPARQL (although it gets more complex when multiple columns are multi-valued), but there is no sense every ontology doing this independently. ROBOT can provide standard exports as it provides standard qc reports.

I am currently trying custom exports from RO, e.g

https://github.com/oborel/obo-relations/blob/issue-295/src/ontology/subsets/ro-biotic-interaction.csv

(analogous fields can be made for classes)

  • Should we define a core set of tabular exports in ROBOT?
    • This could be a tsv export alongside existing obo, owl, json exports
  • Which exports? I suggest
    • objects (relations, classes, instances)
    • existential graph
    • terminological (synonyms, labels)
  • how to deal with multi-valued fields (e.g. |-separated?)
  • allow customizability (e.g. adding new fields - how) vs standard
  • implementation: SPARQL?

cc @jhpoelen

cmungall avatar Apr 05 '19 04:04 cmungall

Having a tabular export would be pretty neat! For specific use case, see https://github.com/jhpoelen/eol-globi-data/issues/386 and https://github.com/oborel/obo-relations/issues/295 . Happy to provide more details if needed.

jhpoelen avatar Apr 05 '19 16:04 jhpoelen

My only concern is that everybody will want something different...

If we want customizability without SPARQL, then I would configure the table using an ordered list of columns names, specified by predicate labels, plus a few special keywords (IRI, CURIE). Then I would join multiple values with pipes. Whatever we do, I think consistent sort order is very important.

My ideal implementation would be the inverse of template, allowing exact round-trips. @rctauber and I have played with that idea, but the interesting bits are hard.

jamesaoverton avatar Apr 05 '19 17:04 jamesaoverton

+1 to above

On an unrelated note: I would love the ? to disappear from the column headers when we use robot query with SPARQL so that I can use the select clause in SPARQL to map to the exact column headers I need for example for a DOSDP scenario.

matentzn avatar Apr 05 '19 17:04 matentzn

Yes, definitely remove the ?s, see #176 for requests on existing sparql query output format

My only concern is that everybody will want something different...

Maybe, but I think having a standard that people can supplement will go a long way

I like the reverse-template idea. To do this the purist way (ie treat the template "language" as both a generative grammar and parser) is appealing but hard, I'd go for practical.

If we want customizability without SPARQL, then I would configure the table using an ordered list of columns names, specified by predicate labels, plus a few special keywords (IRI, CURIE). Then I would join multiple values with pipes. Whatever we do, I think consistent sort order is very important.

This sounds ideal. Some additional constructs for existentials would be great. (I know you like to think in triples but a lot of the stuff that biologists often care about may be embedded in more complex axioms, e.g. part-of)

cmungall avatar Apr 05 '19 23:04 cmungall

Would this be overkill? https://www.w3.org/TR/r2rml/

cmungall avatar Apr 05 '19 23:04 cmungall

@cmungall I don't think R2RML is what we want for this. Let me know if I'm misunderstanding...

I've worked with R2RML for converting SQL tables to RDF in batches. Eventually https://github.com/nkons/r2rml-parser worked for me. There's also Ontop that can use R2RML or their alternative flavour of config to convert a SPARQL query to a SQL query on-the-fly and return triples. I never got Ontop working the way I wanted. There's no support for OWL on this approach, as far as I know, and of course no Manchester support.

We already have ROBOT templates, DOS-DP, etc. for tables->triples. SPARQL is the fundamental triples->tables tool, but OWL is hard to handle with SPARQL. For proper OWL support I see two options:

  1. DLQuery (not a good match for ROBOT templates)
  2. Convert Manchester-with-variables, e.g.?X part of some (?Y and ?Z), into a SPARQL triple pattern, run the query, then render the resulting ?X, ?Y, ?Z back to Manchester. I don't know that OWLAPI would help at all with the first part. I've got Manchester<->triples parsing and rendering in my Knotation library.

jamesaoverton avatar Apr 09 '19 12:04 jamesaoverton

Assuming that owl is rdf and rdf attempts to capture all we know in three term statements, why not start with exporting a three column tsv file? Sort of like ntriples/quads you can easily import into a spreadsheet.

jhpoelen avatar Apr 09 '19 22:04 jhpoelen

It's easy enough to export a table of triples, and that works fine to filter for annotations that you care about. But things immediately get more complicated and less useful when you want to do any more than that:

  • you probably want a fourth column to pull out the datatype, and maybe a fifth column for the language tag
  • you might want distinct columns for lexical values of literals and for object URIs
  • you might want distinct columns for subject URI vs blank node, and object URI vs blank node; in Knotation I use eight fields to represent a "triple"
  • the OWL to RDF mapping can quickly get complicated: https://www.w3.org/TR/2012/REC-owl2-mapping-to-rdf-20121211/

This basic Manchester 'primary remex feather' and ('in taxon' some 'Tyto alba') becomes this Turtle:

[ rdf:type owl:Class ;
  owl:intersectionOf (
    obo:UBERON_0011796
    [
      rdf:type owl:Restriction ;
      owl:onProperty obo:RO_0002162 ;
      owl:someValuesFrom obo:NCBITaxon_56313 ;
    ]
  ) ;
] .

which becomes these triples (approximately):

_:b1 rdf:type owl:Class .
_:b1 owl:intersectionOf _:b2 .
_:b2 rdf:first obo:UBERON_0011796 .
_:b2 rdf:rest _:b3 .
_:b3 rdf:first _:b4 .
_:b3 rdf:rest rdf:nil .
_:b4 rdf:type owl:Restriction .
_:b4 owl:onProperty obo:RO_0002162 .
_:b4 owl:someValuesFrom obo:NCBITaxon_56313 .

It's a pain to read those triples or query them with SPARQL. I've done it before, but it's a pain. That's why I'm dreaming of something more friendly.

jamesaoverton avatar Apr 10 '19 11:04 jamesaoverton

I can see how blank nodes may confuse some and how data formatting issues might cause some interesting cases (e.g., "some label"@en).

Just to re-iterate: my use case is to help data managers / dataset curators to more easily find and import lists of terms to link to and re-use. Over the years of promoting re-use of "ontologies", I found a common source of disconnect to be ontologist that say "ah, that's easy, just write a sparql" where tabular oriented folks are like "ahm, rdf looks kinda scary, can you help me make this happen?", after which the exchange ends due to lack of academic carrots, interest, time, or money. I'd very much like to continue to say "hey just re-use interaction terms from OBO RO", and am now tempted to include "and hire an ontologist to make sense of it all."

Thanks for listening and responding. I hope my comments help put things in (tabular) perspective.

jhpoelen avatar Apr 10 '19 19:04 jhpoelen

How would it make sense to handle anonymous classes?

If a user wants a property as a column, e.g. has_part, it would be easy to return the axioms on individuals. has_part is probably used in anonymous class expressions, though (x has_part some y). These would be returned under columns like rdfs:subClass and owl:equivalentClass, but should it show up under the has_part column as well?

beckyjackson avatar Apr 24 '19 15:04 beckyjackson

I think so. Most users don't think in OWL, more in terms of existential graphs (though they wouldn't call it that)

cmungall avatar Apr 24 '19 17:04 cmungall

Why use a phrase "existential graph" if you know that "most users" don't know what it means?

jhpoelen avatar Apr 24 '19 18:04 jhpoelen

This is an owl-geek space here so we need to be able to communicate the mappings between the language of the underlying model (OWL) and user requirements. The concept an existential graph provides a powerful bridge between OWL constructs (which are not understood by most developers) and familiar constructs such as graphs. See https://douroucouli.wordpress.com/2016/10/04/a-developer-friendly-json-exchange-format-for-ontologies/

cmungall avatar Apr 24 '19 19:04 cmungall

Ok. thanks for the reference.

jhpoelen avatar Apr 24 '19 19:04 jhpoelen

That makes sense to me. How would we differentiate between subclass, equivalent, and disjoint expressions? Maybe by putting the whole statement in? But that seems overly redundant... So I'd love to hear your thoughts.

beckyjackson avatar Apr 24 '19 20:04 beckyjackson

There's also the question of conjuncts vs. disjuncts.

I can break the class expression into a conjunct set and put the correct component in the column (based on the property used so you don't get something like (has_part some x) and (located_in some y) and z in your has_part column), but it seems like a class expression from a disjunct set can't exist alone.

For equivalent classes, it might not even make sense to break it down at all... I might be overthinking this, but I want to make sure the logic is correctly conveyed in these sheets.

beckyjackson avatar Apr 26 '19 14:04 beckyjackson

@cmungall: We'd really appreciate your feedback on @rctauber's questions.

jamesaoverton avatar May 02 '19 18:05 jamesaoverton

First I think the report should be executed on relaxed ontologies, so it should be sufficient to query subClassOf axioms (with named class parents or simple unnested existentials). No need for duplicative code to unwind.

Definitional equivalence axioms (e.g. between a named class and anon class expressions) are a more advanced feature. Not sure it makes sense to have in a generic TSV report. In many cases it's possible to get template-specific reports by running dosdp query

The main question is whether to have one column per object property (meaning the TSV column headings would vary by ontology) or a generic 'parents' column where the values are property-value tuples (requiring minimal parsing). I am tending towards the former

cmungall avatar May 02 '19 19:05 cmungall

In a prototype version of this command that I made, I included an --columns option to specify which properties to include. You can provide things like rdfs:subClassOf, or specify property names and those become the column headers. We could also just have it export all object properties (maybe with --columns ALL, or it could be default behavior...).

If we don't care about subclass vs. equivalence then it would be easier to include the class expressions under their respective object properties. What are your thoughts on conjuncts vs. disjuncts?

beckyjackson avatar May 17 '19 18:05 beckyjackson

On Fri, May 17, 2019 at 11:32 AM Becky Jackson [email protected] wrote:

In a prototype version of this command that I made, I included an --columns option to specify which properties to include. You can provide things like rdfs:subClassOf, or specify property names and those become the column headers. We could also just have it export all object properties (maybe will --columns ALL, or it could be default behavior...).

sounds great

If we don't care about subclass vs. equivalence then it would be easier to include the class expressions under their respective object properties. What are your thoughts on conjuncts vs. disjuncts?

I think my preference would be to ignore equivalence axioms. If we assume running relax-reduce first (which I think is a good idea) then the basic graph information people care about will be there in SubClassOf axioms

cmungall avatar May 17 '19 18:05 cmungall

Thanks for the pointer @cmungall! The DUO use case is very minimal - a table of IDs and labels/definitions - no requirement (yet) for restrictions etc. It'd be great if ROBOT could have a minimal feature for this which I think will be common enough for most resources? Then more complex features can be added to the export?

mcourtot avatar Jun 05 '19 13:06 mcourtot

There is always unix 'cut'

On Wed, Jun 5, 2019 at 6:30 AM Melanie Courtot [email protected] wrote:

Thanks for the pointer @cmungall https://github.com/cmungall! The DUO use case is very minimal - a table of IDs and labels/definitions - no requirement (yet) for restrictions etc. It'd be great if ROBOT could have a minimal feature for this which I think will be common enough for most resources? Then more complex features can be added to the export?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ontodev/robot/issues/459?email_source=notifications&email_token=AAAMMOJWM7EAEPQMB5G33WLPY65VHA5CNFSM4HDYUTM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW7WRBI#issuecomment-499083397, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAMMOMQOEDNHDFJKLWCQI3PY65VHANCNFSM4HDYUTMQ .

cmungall avatar Jun 10 '19 23:06 cmungall

Everyone who cares about this issue: PR #481 is almost ready to go, but I'd like feedback on a few points. Please read my comment over there and reply.

jamesaoverton avatar Jun 12 '19 14:06 jamesaoverton

This appears to be a a nice feature. We may add such a feature to Ontobee as a web feature as well.

yongqunh avatar Dec 22 '19 22:12 yongqunh

We now have the export command. Can we close this issue, or does it require more discussion?

beckyjackson avatar Feb 02 '21 15:02 beckyjackson

@beckyjackson great to hear that robot now has an export command. Can you provide (or point to) some examples on how to use this. For instance, how would I create a tsv export of the Relations Ontology. Thanks for all your work!

jhpoelen avatar Feb 02 '21 15:02 jhpoelen

@jhpoelen The docs can be found here: http://robot.obolibrary.org/export

If you have questions on how to create a specific export, please let me know!

beckyjackson avatar Feb 02 '21 15:02 beckyjackson

@beckyjackson thanks!

jhpoelen avatar Feb 02 '21 15:02 jhpoelen