ro-crate icon indicating copy to clipboard operation
ro-crate copied to clipboard

Use Case: Describe a tabular data file directly in RO-Crate metadata

Open eocarragain opened this issue 6 years ago • 39 comments
trafficstars

As a researcher working with tabular data, I want to be able to define the columns (description, data-type, valid values/ranges, etc.), so that I can provide a structured data dictionary.

Approaches elsewhere:

eocarragain avatar Aug 01 '19 10:08 eocarragain

Note: we may need a more general use-case for how to express sub-file/variable level metadata. Some concrete non-tabular examples would be good though

eocarragain avatar Aug 01 '19 10:08 eocarragain

Discussed this on Editor's call 2019-08-08, and agreed it would be good to use the schema.org flavour if possible, e.g. Dataspice, Psych-DS

eocarragain avatar Aug 08 '19 21:08 eocarragain

The table below compares the Frictionless Data tabular data specs with the schema.org variableMeasured property. It also shows the additional fields that the psych-ds team have added on top of their use of variableMeasured.

table_schema schema:variableMeasured psych-ds
dialect
name name schema:name
description description schema:name
title alternateName
type type
type>rdfType propertyId
format
constraints>required
constraints>unique
constraints>minLength minValue schema:minValue
constraints>maxLength maxValue schema:maxValue
constraints>minimum ~minValue
constraints>maximum ~maxValue
constraints>pattern
constraints>enum levels
missingValues na/naValues
primaryKey
foreignKeys
~type>rdfType unitCode schema:unitCode
~type>rdfType unitText schema:unitText
derivation
imputation

Notes:

  • table schema enumerates types (e.g. string, Boolean, number) and formats (e.g. email address, ISO8601) for fields, see https://frictionlessdata.io/specs/table-schema#types-and-formats. There is no equivalent in schema.org
  • table_schema doesn't have a direct equivalent of unitCode or unitText but type>rdfType could probably be used

eocarragain avatar Oct 03 '19 18:10 eocarragain

This sounds a little like: https://www.w3.org/TR/tabular-data-primer/#string-restriction Why not reuse it? EDIT: Oh, I see you listed it above, but it covers all the constraints nicely...

dgarijo avatar Oct 03 '19 18:10 dgarijo

@dgarijo agreed "csvw" is probably the most complete rdf-friendly way to do this. It also has the benefit that Google seem to be adopting it in the dataset search. However, we received quite strong feedback at Open Repositories that CSVW was 'too complicated' for most researchers & coders to pick up and use easily.

There may be ways around this in terms of how we present it in the RO-Crate spec, i.e. just provide examples of the most common cases, more or less equivalent to table-schema?

EDIT: if we did this, the psych-ds community might be a good test group as they are clearly struggling with the fact that schema.org doesn't quite do what they need

eocarragain avatar Oct 03 '19 18:10 eocarragain

I don't think you need to adopt all of it, just the parts that cover your use cases (as you point out). In PROV we had like 3 main concepts and 8 relationships among them and people still said it was complicated...

dgarijo avatar Oct 03 '19 18:10 dgarijo

Example of what the schema.org approach would look like in an RO-Crate context:

{ "@context": "https://w3id.org/ro/crate/0.3-DRAFT/context",
  "@graph": [
  {
    "@id": "./",
    "@type": [
      "Dataset"
    ],
    "hasPart": [
      {
        "@id": "./table.csv"
      },
      ],
   },
  {
    "@id": "./table.csv",
    "@type": ["File", "Dataset"],
    "contentSize": "383766",
    "description": "A table capturing all my data",
    "variableMeasured": [
        {
        "type": "PropertyValue",
        "unitText": "metres",
        "name": "wall_width",
        "description": "The width of the wall in metres"
        },
        {
        "type": "PropertyValue",
        "unitCode": "CMT",
        "name": "wall_height",
        "description": "The height of the wall in centimetres"
        },
        {
        "type": "PropertyValue",
        "name": "datetime",
        "description": "The date and time of the measurement"
        },
    ]    
  },

]

Issue: in schema.org variableMeasured is only defined as a property of schema:Dataset, i.e. it cannot be used on an RO-Crate file as this maps to schema:MediaObject

EDIT: made the file a Dataset in the example above following @dgarijo's comments below

eocarragain avatar Oct 03 '19 18:10 eocarragain

Are they disjoint (I don't see anything about that in schema.org)? If not, I don't see the problem in using them.

dgarijo avatar Oct 03 '19 18:10 dgarijo

Would that mean making all ro-crate "files" be both schema:MediaObject and schema:Dataset?

eocarragain avatar Oct 03 '19 19:10 eocarragain

not all of them, just the ones you want to describe with those properties. A research object may contain many files. Some of them may be datasets. Some may be Slides, workflows, SoftwareApplications...

dgarijo avatar Oct 03 '19 19:10 dgarijo

Ok - made that change in the example above. Fact remains that schema.org doesn't cover a lot of common use cases for describing tabular data, so should we look at providing a simplified subset of CSVW more or less corresponding to table_schema?

eocarragain avatar Oct 03 '19 19:10 eocarragain

I have a naive question: if the tabular format is an standard one, described in some ontology (but not at this granularity level), what should we do?

jmfernandez avatar Oct 03 '19 20:10 jmfernandez

@dgarijo also mentions https://www.w3.org/TR/vocab-data-cube/

stain avatar Oct 03 '19 20:10 stain

I have a naive question: if the tabular format is an standard one, described in some ontology (but not at this granularity level), what should we do?

@stian suggested conformsTo or schema:additionalType (or maybe schema:schemaVersion)

eocarragain avatar Oct 03 '19 21:10 eocarragain

isatab is another example

eocarragain avatar Oct 03 '19 21:10 eocarragain

Hello, we are really interesting into using Ro-Crate for a project and this use case would also be really important for us. Is there any news on this in general or integrating an existing solution as listed above? Thanks!

LauraWalters avatar Feb 22 '21 14:02 LauraWalters

Thanks, @LauraWalters, for re-awakening this discussion - I've added this to the agenda for the RO-Crate Community Call this Thursday.

It would be good to hear more about your project's requirement on this, either in this issue or in the call.

Feel free to join if you have time, see #1 or https://s.apache.org/ro-crate-minutes for call details!

stain avatar Feb 22 '21 15:02 stain

Also worth looking at GA4GH Search API specification, which include a JSON-based table definition.

stain avatar Mar 02 '21 12:03 stain

@stain @LauraWalters @jmfernandez - just want to re-awake this discussion.

Has anyone done this for RO-Crate?

I have a simple example I want to code up from here: https://github.com/JTrippas/Spoken-Conversational-Search

How should I turn their text description of columns in a CSV into something in RO-Crate? Or should I justt create a text file with the text in it and link it as an encoding format.

ptsefton avatar Aug 23 '21 05:08 ptsefton

@stain the link to GA4GH Search API specification above is 404.

ptsefton avatar Aug 23 '21 05:08 ptsefton

@stain the link to GA4GH Search API specification above is 404.

@ptsefton I have been having a look, and the repo and the target file were renamed. Here you are a more stable link to the example https://github.com/ga4gh-discovery/data-connect/blob/3a9be1fab628d0278eedcb5e70bb7d55f7d0a081/SPEC.md#table-discovery-and-browsing-examples

jmfernandez avatar Aug 23 '21 10:08 jmfernandez

From the spec pointed out by @stain and my point of view, a CSV/TSV can be semantically described on one hand by the needed parameters to open it in R, Python or similar (encoding, column separator, comment character, etc...), and on the other hand enumerating the name, syntactic or semantic type and logical position of the columns.

EDIT: I have just read @ptsefton answer at https://github.com/ResearchObject/ro-crate/issues/64#issuecomment-903470850 , and W3C tabular metadata spec seems to cover all these points.

jmfernandez avatar Aug 23 '21 10:08 jmfernandez

@jmfernandez

How about we use W3C tabular metadata - but with its prefix so we get confused with different definitions of name for example.

Here's an example reworked from the example 2:

{
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}]
"@graph": [
  "@id": "tree-ops.csv",
  "name": "Tree Operations",
  "keyword": ["tree", "street", "maintenance"],
  "publisher": {
  ...
  },
  "license": {"@id": "http://opendefinition.org/licenses/cc-by/"},
  "dateModified": {"2010-12-31"},
  "csvw:tableSchema": {
    "csvw:columns": [{
      "csvw:name": "GID",
      "csvw:titles": ["GID", "Generic Identifier"],
      "description": "An identifier for the operation on a tree.",
      "csvw:datatype": "string",
      "csvwrequired": true
    }, {
      "csvw:name": "on_street",
      "csvw:titles": "On Street",
      "description": "The street that the tree is on.",
      "csvw:datatype": "string"
    }, {
      "csvw:name": "species",
      "csvw:titles": "Species",
      "description": "The species of the tree.",
      "csvw:datatype": "string"
    }, {
      "csvw:name": "trim_cycle",
      "csvw:titles": "Trim Cycle",
      "description": "The operation performed on the tree.",
      "csvw:datatype": "string"
    }, {
      "csvw:name": "inventory_date",
      "csvw:titles": "Inventory Date",
      "description": "The date of the operation that was performed.",
      "csvw:datatype": {"base": "date", "format": "M/d/yyyy"}
    }],
    "csvw:primaryKey": "GID",
    "csvw:aboutUrl": "#gid-{GID}"
  }
]
}



ptsefton avatar Aug 26 '21 06:08 ptsefton

Yes, I agree, if the standard already exists, we should reuse it. And btw, it could be a nice example about using annotations based on third-party ontologies along with RO-Crate. We could even consider the inclusion of a list of useful standards / ontologies, depending on the use case.

jmfernandez avatar Aug 26 '21 11:08 jmfernandez

@ptsefton to have a go at reworking example with explicit @type and flattened JSON-LD. This can become a new page in the spec.

stain avatar Aug 26 '21 20:08 stain

Have tried this out.

A CSV file can have a schema

image


Here we see a column definition referencing one with a similar spelling with sameAs

image

"@graph": [
    {
      "@id": "#Action",
      "@type": "csvw:Column",
      "csvw:datatype": "string",
      "description": "The action the participant takes in that utterance, these actions are described in the code book and allow for reproduction of the results.",
      "name": "Action",
      "sameAs": {
        "@id": "#Code"
      }
    },
    {
      "@id": "#Actor_pair",
      "@type": "csvw:Column",
      "csvw:datatype": "",
      "description": "13 different pairs completed three tasks. This column distinguishes the different pairs for each task (1-13)",
      "name": "Actor_pair"
    },
    {
      "@id": "#Code",
      "@type": "csvw:Column",
      "csvw:datatype": "",
      "description": "The action the participant takes in that utterance, these actions are described in Trippas et al. (2020)",
      "name": "Code",
      "sameAs": {
        "@id": "#Action"
      }
    },
    {
      "@id": "#File.name",
      "@type": "csvw:Column",
      "csvw:datatype": "string",
      "description": "Indicating the group number (2-14) and the date of the experiment.",
      "name": "File.name"
    },
    {
      "@id": "#Notes",
      "@type": "csvw:Column",
      "csvw:datatype": "string",
      "description": "Comments such as the particular search is stopped by the user or researcher or extra notes which relate to the action of the participant regarding the search session. *not included in the \"SCSdataset.csv\"",
      "name": "Notes"
    },
    {
      "@id": "#Query",
      "@type": "csvw:Column",
      "csvw:datatype": "string",
      "description": "The reference to the information need participants are solving.",
      "name": "Query"
    },
    {
      "@id": "#Query.complexity",
      "@type": "csvw:Column",
      "csvw:datatype": "string",
      "description": "One of three levels, referencing the task complexity type (remember, understand, and analyse).",
      "name": "Query.complexity",
      "sameAs": {
        "@id": "#Query_complexity"
      }
    },
    {
      "@id": "#Query.counter",
      "@type": "csvw:Column",
      "csvw:datatype": "string",
      "description": "A counter which keeps track of how many turns there have been between the participants in that conversation. For the initial data release only the first two turns are given. However, the first three turns are presented if the second turn is classified under the Meta-communcation Theme (See CHIIR 2017 paper for further information).",
      "name": "Query.counter",
      "sameAs": {
        "@id": "#Query_counter"
      }
    },
    {
      "@id": "#Query_complexity",
      "@type": "csvw:Column",
      "csvw:datatype": "",
      "description": "One of three levels, referencing the task complexity type (remember, understand, and analyse).",
      "name": "Query_complexity"
    },
    {
      "@id": "#Query_counter",
      "@type": "csvw:Column",
      "csvw:datatype": "",
      "description": "A counter which keeps track of how many turns there have been between the participants in that conversation.",
      "name": "Query_counter",
      "sameAs": {
        "@id": "#Query.counter"
      }
    },
    {
      "@id": "#Role",
      "@type": "csvw:Column",
      "csvw:datatype": "string",
      "description": "Which of the participants is talking in that particular utterance. The roles are annotated as A_User (participant who has the information need which needs to be solved) and B_Receiver (person who has access the the computer and search engine).",
      "name": "Role"
    },
    {
      "@id": "#Start.time",
      "@type": "csvw:Column",
      "csvw:datatype": "string",
      "description": "Start time of the utterance.",
      "name": "Start.time",
      "sameAs": {
        "@id": "#Start_time"
      }
    },
    {
      "@id": "#Start_time",
      "@type": "csvw:Column",
      "csvw:datatype": "",
      "description": "Start time of the utterance.",
      "name": "Start_time",
      "sameAs": {
        "@id": "#Start.time"
      }
    },
    {
      "@id": "#Stop.time",
      "@type": "csvw:Column",
      "csvw:datatype": "string",
      "description": "Stop time of the utterance.",
      "name": "Stop.time",
      "sameAs": {
        "@id": "#Stop_time"
      }
    },
    {
      "@id": "#Stop_time",
      "@type": "csvw:Column",
      "csvw:datatype": "",
      "description": "Stop time of the utterance.",
      "name": "Stop_time",
      "sameAs": {
        "@id": "#Stop.time"
      }
    },
    {
      "@id": "#Sub_themes",
      "@type": "csvw:Column",
      "csvw:datatype": "",
      "description": "Subthemes based on codes as described in Trippas et al. (2020)",
      "name": "Sub_themes"
    },
    {
      "@id": "#Transcript",
      "@type": "csvw:Column",
      "csvw:datatype": "string",
      "description": "Transcripts of the utterance of the particular user in that particular times lot.",
      "name": "Transcript",
      "sameAs": {
        "@id": "#Transcription"
      }
    },
    {
      "@id": "#Transcription",
      "@type": "csvw:Column",
      "csvw:datatype": "",
      "description": "Transcripts of the utterance of the particular user in that particular timeslot.",
      "name": "Transcription",
      "sameAs": {
        "@id": "#Transcript"
      }
    },
    {
      "@id": "#ffaa324f-bdec-4bf5-a260-62cc39580129",
      "@type": "Person",
      "affilitation": "\" https://ror.org/04ttjf776",
      "name": "Paul Thomas"
    },
    {
      "@id": "#schema-ConversationalSearchDataSet.csv",
      "@type": "csvw:Schema",
      "columns": [
        {
          "@id": "#Start.time"
        },
        {
          "@id": "#Stop.time"
        },
        {
          "@id": "#Query"
        },
        {
          "@id": "#Query.complexity"
        },
        {
          "@id": "#Role"
        },
        {
          "@id": "#Action"
        },
        {
          "@id": "#Transcript"
        },
        {
          "@id": "#Notes"
        },
        {
          "@id": "#Query.counter"
        },
        {
          "@id": "#File.name"
        }
      ],
      "name": "Schema for ConversationalSearchDataSet.csv"
    },
    {
      "@id": "#schema-SCSdata_v1.csv",
      "@type": "csvw:Schema",
      "columns": [
        {
          "@id": "#Start_time"
        },
        {
          "@id": "#Stop_time"
        },
        {
          "@id": "#Query"
        },
        {
          "@id": "#Query_complexity"
        },
        {
          "@id": "#Role"
        },
        {
          "@id": "#Sub_themes"
        },
        {
          "@id": "#Code"
        },
        {
          "@id": "#Query_counter"
        },
        {
          "@id": "#Transcript"
        },
        {
          "@id": "#Actor_pair"
        }
      ],
      "name": "Schema for SCSdata_v1.csv"
    },


ptsefton avatar Sep 09 '21 06:09 ptsefton

Revisiting this as part of our work on the Text Commons RO-Crate profile.

Here's what we have now (including some new terms that are defined in a custom context)

A CSV file references a schema using the csvw:tableSchema property:


 {
      "@id": "files/427/original_bad0fd7f9c918df1db8b6a5b39faec48.csv",
      "@type": [
        "File",
        "Annotation"
      ],
      "name": "Transcript of interview with Patricia Colless full text transcription (CSV)",
      "encodingFormat": "text/csv",
      "annotationType": [
        {
          "@id": "olac:Transcription"
        },
        {
          "@id": "olac:TimeAligned"
        }
      ],
      "modality": {
        "@id": "olac:Orthography"
      },
      "annotationOf": {
        "@id": "files/503/original_779656ecdb38dfb06cee9440773692a7.mp3"
      },
      "language": {
        "@id": "https://www.ethnologue.com/language/eng"
      },
      "csvw:tableSchema": {
        "@id": "#dialog_schema"
      },
      "size": 54363
    },

{
      "@id": "#dialog_schema",
      "@type": "csvw:Schema",
      "name": "Table schema for dialogue transcript",
      "columns": [
        {
          "@id": "#speaker"
        },
        {
          "@id": "#transcript"
        },
        {
          "@id": "#start_time"
        },
        {
          "@id": "#notes"
        }
      ]
    },
    {
      "@id": "#speaker",
      "@type": "csvw:Column",
      "csvw:datatype": "string",
      "description": "Which of the participants is talking in that particular utterance. ",
      "name": "speaker"
    },
    {
      "@id": "#transcript",
      "@type": "csvw:Column",
      "csvw:datatype": "string",
      "description": "Transcription of speaker turn",
      "name": "text",
      "sameAs": {
        "@id": "olac:Transcription"
      }
    },
    {
      "@id": "#start_time",
      "@type": "csvw:Column",
      "description": "Start time of the utterance.",
      "name": "time",
      "sameAs": {
        "@id": "https://schema.org/startTime"
      }
    },
    {
      "@id": "#notes",
      "@type": "csvw:Column",
      "csvw:datatype": "string",
      "description": "Additional information",
      "name": "notes"
    },

This has some advantages over the schema.org approach suggested above by @eocarragain many moons ago and used by Science on Schema.org.

  • This approach explicitly links to the column names rather than just being a convention that the name of a PropertyValue matches a CSV header.
  • The schema.org approach uses variableMeasured which is not always going to a good semantic match with the contents of a column. We're not measuring variables in our example, we're transcribing a conversation.

On the other hand, the csvw spec is very complicated and very strict, and by bringing it into the the schema.org world we're not really using it properly - the sameAs reference to schema.org terms feels like a bit of a hack.

Maybe we should aim to bring the best of csvw into schema.org? (And while we're at it we could include worksheet as level of orgnization so we can deal with spreadsheets)

ptsefton avatar May 19 '22 05:05 ptsefton

Including file content definitions is an important use case for our project. We've been working with concepts from the frictionless data framework to define file types that include many permutations of manually assembled and machine generated data files. A common scenario is for several different labs to produce assay data files that contain corresponding columns that could be aggregated for analysis, but there is no way to know that from the file headers. Using some of the concepts from frictionless, we define file types containing field descriptors, which can map to an rdf type so a data consumer will know which columns across various file types may be integrated. Though frictionless is geared toward tabular files, the field descriptors could be used to describe non-tabular data file contents as well.

Now that we are moving to RO Crates to package our metadata and files, we'd like to include these file type definitions in the crate metadata. Ideally, we'd like to be able to include a context entity for each file type and link these to the data files. The file type context entity would include the frictionless field descriptors. Following is an example of what this might look like (using "FrictionlessFileType" as a placeholder.) We are pretty new to RO Crates, so any advice is appreciated.

{
  "@context": "https://w3id.org/ro/crate/1.1/context",
  "@graph": [
    {
      "@id": "./",
      "@type": "Dataset",
      "datePublished": "2022-05-27T18:45:24+00:00",
      "hasPart": [
        {
          "@id": "study_3-1/Food_Intake_9.3.2020.csv"
        }
      ]
    },
    {
      "@id": "study/Food_Intake_9.3.2020.csv",
      "@type": "File",
      "contentSize": "27710",
      "name": "study/Food_Intake_9.3.2020.csv",
      "frictionlessFileType": {
        "@id": "food_intake_phenotype"
      }
    },
    {
      "@id": "food_intake_phenotype",
      "@type": "FrictionlessFileType",
      "encoding": "iso8859-1",
      "format": "csv",
      "hashing": "md5",
      "schema": {
        "fields": [
          {
            "id": "animal_diet",
            "name": "Diet",
            "type": "string",
            "description": "Animal diet",
            "rdfType": "http://www.ebi.ac.uk/efo/EFO_0002755",
            "constraints": {
              "required": "true",
              "enum": [ "Envigo HFHS", "10% fat + fiber", "6% fat" ]
            }
          },
          {
            "id": "animal_weight",
            "name": "Weight",
            "type": "number",
            "description": "Animal weight on day 0",
            "rdfType": "http://www.ebi.ac.uk/efo/EFO_0004338",
            "constraints": {
              "required": "true"
            }
          }
        ]
      }
    }
  ]
}

a-mile avatar May 27 '22 18:05 a-mile

This is an interesting approach I think Abigail - structurally it has quite a similar topology to the csvw approach but the documentation for Frictionless data is much more approachable.

A couple of comments - for RO-Crate the graph needs to be flattened - so all the fields with have to be separate entities with a @type attribute, FrictionlessField or maybe fd:Field if we used a namespace. Also the IDs should be URIs so, either #animal_weight or an http URI if you want to re-use them.

The constraints part is also problematic as for RO-Crate that would also need to be a separate entity - but in an RO-Crate dialect that could be direct properties of the field.

It could look something like this, maybe:

{
            "@id": "#animal_weight",
            "@type": "fd:Field",
            "name": "Weight",
            "fd:type": "number",
            "description": "Animal weight on day 0",
            "fd:rdfType": "http://www.ebi.ac.uk/efo/EFO_0004338",
           
              "fd:required": "true"
           
          }

OR another approach would be to put the frictionless schema in a file or at a URL and reference it - that way we don't have to force it into JSON-LD and it should work with FD tools. I think this is probably the way to go.

ptsefton avatar May 30 '22 04:05 ptsefton

At the Language Data Commmons of Australia we are taking the second approach I mentioned above, and implementing frictionless table schemas included as a data entity in an RO-Crate - initial documentation is here in the draft profile for language resources.


{
   "@id": "conversation1.csv",
  "@type" :["File"],
  "encodingfomat":  "text/csv",
  "name": "Transcript of conversation 1".
 "conformsTo": {"@id" : "arcp://name,ausnc.ary/csv_schema")  

}

{
  "@id":  "arcp://name,ausnc.art/csv_schema", ← REPOSITORY-UNIQUE NAME
  "Type": "CreativeWork",
  "name": "Frictionless Table Schema for CSV transcription files in the ART corpus"
  "sameAs": "art_schema.json". ← Reference to the schema file above TODO: is this the best link?
  "conformsTo": {"@id" : "https://specs.frictionlessdata.io/table-schema/")  

}


{
   "@id": "artSchema",
  "@type" :["File"],
  "encodingfomat":  "text/csv",
  "name": "Frictionless Table Schema file for CSV transcription files in the ART corpus".
  "conformsTo": {"@id" : "https://specs.frictionlessdata.io/table-schema/")  

}

ptsefton avatar Jul 28 '22 19:07 ptsefton