ro-crate icon indicating copy to clipboard operation
ro-crate copied to clipboard

Referencing sections of a document

Open simleo opened this issue 3 years ago • 4 comments
trafficstars

While converting a cwltool --provenance RO to a Workflow Run RO-Crate, I'm faced with the problem of referring to individual workflow steps. The workflow is stored in "packed" form, meaning that the tools that implement each step are stored in the same packed.cwl document as the workflow. For the packed form, CWL uses the URI fragment syntax to assign IDs to the steps and the workflow itself; in this case, they are:

  • Workflow: #main
  • First step ("rev"): #main/rev
  • Second step ("sorted"): #main/sorted

The workflow appears in the crate as a data entity with an @id of packed.cwl, so I decided to add the tools as SoftwareApplication entities with @id packed.cwl#rev and packed.cwl#sorted (whether this is correct is another matter: should they be packed.cwl#main/rev and packed.cwl#main/sorted instead?). Using fragments here seems quite reasonable, since the secondary resource is certainly "some portion or subset of the primary resource". However, should the tools be considered contextual entities or data entities? At first I tried to add them ad contextual entities:

crate.add(SoftwareApplication(crate, instrument_id, properties={
    "name": instrument_id,
}))

Leading to:

{
    "@id": "packed.cwl",
    "@type": ["File", "SoftwareSourceCode", "ComputationalWorkflow"],
    "hasPart": [
        {"@id": "#packed.cwl#rev"}
        {"@id": "#packed.cwl#sorted"},
    ],
    ...
},
...

Which does not really seem to work, due to the leading # in the tool IDs (ro-crate-py automatically adds a leading hash mark to contextual entity IDs if they're not full URIs: I'm not sure this is a MUST in the RO-Crate spec, but it's at least implied), so I tried adding them as data entities:

crate.add(DataEntity(crate, instrument_id, properties={
    "@type": "SoftwareApplication",
    "name": instrument_id,
}))

Leading to:

{
    "@id": "./",
    "@type": "Dataset",
    "hasPart": [
        {"@id": "packed.cwl"},
        {"@id": "packed.cwl#rev"},
        {"@id": "packed.cwl#sorted"},
	...
    ],
    ...
{
    "@id": "packed.cwl",
    "@type": ["File", "SoftwareSourceCode", "ComputationalWorkflow"],
    "hasPart": [
        {"@id": "packed.cwl#rev"}
        {"@id": "packed.cwl#sorted"},
    ],
    ...
},
...

I think this is more correct since section IDs have a document_id "#" fragment structure. However, having packed.cwl#rev and packed.cwl#sorted listed in the crate's hasPart seems a bit weird. The current spec says "where files and folders are represented as Data Entities in the RO-Crate JSON-LD, these MUST be linked to, either directly or indirectly, from the Root Data Entity using the hasPart property". However, these are not files, but file sections, and would still be linked indirectly (via packed.cwl) if removed from the crate's hasPart. Therefore, I think the spec should say that such "sections" MAY be listed.

I've made use of the workflow step example throughout the above discussion, but it actually generalizes to referencing sections of a document of any kind, when the document is part of the crate.

simleo avatar Apr 27 '22 10:04 simleo

so I decided to add the tools as SoftwareApplication entities with @id packed.cwl#rev and packed.cwl#sorted (whether this is correct is another matter: should they be packed.cwl#main/rev and packed.cwl#main/sorted instead?)

If used, it should be packed.cwl#main/rev and packed.cwl#main/sorted; there is neither a #rev nor #sorted in that document

mr-c avatar Apr 28 '22 14:04 mr-c

Discussed at today's RO-Crate meeting:

  • Add them as Contextual entities
  • Python Library: don't add a leading hash if there's already one in the id

simleo avatar Apr 28 '22 20:04 simleo

Right, packed.cwl#main/rev would be the way to refer to #main/rev within packed.cwl - CWL is unusual in that it has slash-based fragments, but this is also possible with XPath selectors for XML docs.

We could still add a section about referencing parts of other documents (which may even be contextual entities in another RO-Crate, some other Linked Data document, or just a section in a HTML/PDF), to clarify that you can use any URI/URI Reference with # in identifiers of contextual entities.

stain avatar May 12 '22 07:05 stain

There's a part about document section now as part of profiles, not quite right section for what this issue talks about. Uses WebPageElement, see https://www.researchobject.org/ro-crate/specification/1.2-DRAFT/data-entities.html#adding-detailed-descriptions-of-encodings

stain avatar Oct 04 '24 01:10 stain