cwlprov
cwlprov copied to clipboard
cwlprov:relationship sketch
Together with #1 this attempts to find a way to pre-define domain-specific provenance that would be generated at workflow run time. The idea is define a set of relationships that will be added onto the produced outputs of a step to relate it to other data values or concepts at creation time.
These can use domain-specific ontologies like EDAM ontology or BioSchemas, or more generic ones likes PROV or schema.org
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: Workflow
inputs:
first_input: File
second_input: long
steps: []
outputs:
first_output:
type: File
outputSource: first_input
cwlprov:relationships:
prov:wasDerivedFrom: [ '#inputs.second_input' ]
prov:wasInfluencedBy: [ '#inputs.second_output' ]
$namespaces:
prov: http://www.w3.org/ns/prov#
cwlprov: https://w3id.org/cwl/prov#
$schemas:
- http://www.w3.org/ns/prov.owl
As this is a relationship to be generated between values of first_output and second_output, I think some kind of template or expression?
JSON-LD with $expansions
cwlprov:relationship:
{ "@id": "$second_output",
"prov:wasDerivedFrom": "$first_output" }
Or if we assume the current port is the subject and you can't do arbitrary structures you can just have property-object references (no literals in this case):
cwlprov:relationship: {
"prov:wasDerivedFrom": "$first_output",
"example:foo": "edam:topic_0091",
}
Namespaces like prov and edam here must be defined in CWL $namespaces. The template is expanded based on identifiers for the produced values (e.g. urn:uuid:8c97eb7a-94d8-40bf-a932-7e888445f2ec).
If we have:
{ "first_output": {
"@id": "urn:uuid:a1626deb-a5a8-4b84-803e-8dd51f80bf2d"
},
"second_output": {
"@id": "urn:uuid:6e076c8b-d3fe-47f0-844b-b0e1561d3181"
}
}
Then with expansion of namespaces and $variables we get:
{ "first_output": {
"@id": "urn:uuid:a1626deb-a5a8-4b84-803e-8dd51f80bf2d"
},
"second_output": {
"@id": "urn:uuid:6e076c8b-d3fe-47f0-844b-b0e1561d3181",
"http://www.w3.org/ns/prov#wasDerivedFrom": {
"@id": "urn:uuid:a1626deb-a5a8-4b84-803e-8dd51f80bf2d"
},
"http://example.com/foo": {
"@id": "http://edamontology.org/topic_0091"
}
}
}
[ updated by @mr-c to add missing commas, make the UUIDs unique ]
@stain Thank you for the json-ld example.
I've updated my sketch to show that we might want to set relationships between an output and another output and also an input
OK, in 036af7c78a3e1c5125009ae05dbdb853afca6790 I try to sketch out how this can be recorded as templates in the CWL, and then add these to the PROV. There is an issue in what to call these (here cwlprov:relationships and how to reference the variables to fill in at execution time (here using a direct reference #inputs.first_input).
But this leads to fairly misleading information in cwlprov --print-rdf in that it would claim the output parameter definition has a "relationship" to an anonymous object, which then "is derived from" (or whatever property is used) an input parameter definition. This is acceptable if we think of the input/object parameter as a "superobject" of every object that passes through it, as in every file object prov:specializationOf the parameters it is input or output at.
(this is like saying Stian is a specialisation of CustomerOfTesco because I went shopping at Tesco once)
See also PROV-Template which would use a special var namespace for pre-existing variables, which we could bind directly to the input/output objects using existing CWL Expressions (e.g. $(inputs.message) -> var:inputs.message)
Here are some of the mappings we should be able to do https://gist.github.com/stain/f0b0d966a103b1533d684aa6d7197364
The data concepts are often more complex expressions than pure typing from EDAM ontology or BioSchemas - so it might be we need to support more than 1 triple-level expressions as explored here and in #1.