git2rdata icon indicating copy to clipboard operation
git2rdata copied to clipboard

Make use of Frictionless Table Schema to store metadata

Open peterdesmet opened this issue 3 years ago • 10 comments

Suggestion: rather than using a custom format to store metadata about fields and their data types, it might be worth looking into the Frictionless Table Schema. It is a specification to store information about tabular data as a json (or potentially yaml) file. The elements that are similar could be borrowed and it can be extended with the properties that are specifically needed for git2rdata.

Here's an snippet from an example (taken from https://github.com/inbo/datapackage/blob/b049504a1396bfddf7af7e595f5b856da02375d0/inst/extdata/datapackage.json#L133-L154)

          {
            "name": "count",
            "type": "integer",
            "constraints": {
              "required": false,
              "minimum": 1
            }
          },
          {
            "name": "age",
            "type": "string",
            "constraints": {
              "required": false,
              "enum": [
                "adult",
                "subadult",
                "juvenile",
                "offspring",
                "undefined"
              ]
          }

peterdesmet avatar Jun 21 '21 14:06 peterdesmet

Discussed on June 29. A non-invasive option for git2rdata would be a function that generates a datapackage.json file, which effectively makes a collection of file.tsvs a Data Package.

  • The file.ymls and file.tsvs are untouched. The file.ymls are ignored by any Data Package readers, but can be used by git2rdata.
  • User provides list of tsv files that should be resources in the Data Package
  • User can provide some high level metadata, e.g. name, license, description
  • datapackage.json does not keep in sync with later changes. Function has to be run again.
  • Function uses metadata in file.ymls to create Data Resource and Table Schema information in datapackage.json
  • Function could be provided by datapackage or git2rdata, depending on which depends on which (avoiding circular dependencies).

Example:

dep.tsv
dep.yml
obs.tsv
obs.yml

Call function:

make_datapackage(name = "my-dataset", license = "CC0-1.0", resources = c("dep.tsv", "obs.tsv"))

Result:

dep.tsv
dep.yml
obs.tsv
obs.yml
datapackage.json

With datapackage.json:

{
  "name": "my-dataset",
  "profile": "tabular-data-package",
  "licenses": { "name": "CC0-1.0" },
  "resources": [
    {
      "name": "dep",
      "path": "dep.tsv",
      "dialect": ...,
      "schema": ...
    },
    {
      "name": "obs",
      "path": "obs.tsv",
      "dialect": ...,
      "schema": ...
    }
  ]
}

peterdesmet avatar Jun 29 '21 13:06 peterdesmet

Seems like a good idea to me.

It appears that the git2rdata .yml files would be kept out of the datapackage? While its main use is specific to R (variable type, git2rdata version etc), it also defines factor levels, which are only represented as numerical indices in .tsv. Are they contained in the json file as well?

florisvdh avatar Jun 29 '21 13:06 florisvdh

git2rdata .yml files would be kept out of the datapackage?

No, they can remain there.

Factors levels can be expressed as an enum in datapackage, but that only works if the values in the csv are the factor levels, not the factor indeces.

peterdesmet avatar Jun 29 '21 16:06 peterdesmet

tsv files seems to be an unknown file format for the general public. Therefore I'm thinking to use it only with write_vc(optimize = TRUE). As the optimization requires either the git2rdata package to read it or an expert to figure it out, the tsv format is not an issue. For the version with write_vc(optimize = TRUE) we have IMHO two options. 1) Still keep it tab delimited but use the .txt file extension. 2) Switch to csv with , as separator and . as decimal point.

ThierryO avatar Oct 11 '21 10:10 ThierryO

IMHO tab-delimited data has less reading and interpretation problems than csv, since tabs are rarely (or never) used within data (strings), while commas and semicolons often occur in data values (and both are used as separators in csv). So I think the choice to use tsv was a smart one!

Maybe the discussion then is about the file extension. Is a .tsv extension recognized as a tabular data file in Windows? One could make the choice for .txt / .tsv based on that.

florisvdh avatar Oct 11 '21 10:10 florisvdh

The main problem is that some users don't recognise the tsv file. And hence do know what to do with it. Which file format are people more likely to recognise? And do they know how to open / import that format into e.g. Excel.

ThierryO avatar Oct 11 '21 10:10 ThierryO

The most recognizable is .csv. It is my preferred option for data, especially because the extension is almost synonymous with "data". I would reserve its use for data that are indeed comma delimited. @florisvdh I think using commas as a delimiter is fine: most programmes handle " escaped data values well.

If you want to stick with tab-delimited, you could opt for .tsv or .txt Both will be read out of the box by Excel (not tab-delimited .csv), but with the typical data and number handling issues that Excel has. GBIF downloads are tab-delimited .txt files. I think the main downside of using .txt is that it does not imply "data" the same way .csv does.

peterdesmet avatar Oct 11 '21 11:10 peterdesmet

Admittedly, I also like to have the rarely used .tsv format for saving data with git2rdata, but for another reason as @florisvdh: users don't know it, so they are less eager to open and edit the files manually. And if they for some reason add data to a data repo without using write_vc(), they will use .csv, making these manually added files easily distinguishable. So the less known format prevents uninformed users from messing up stuff that was generated by git2rdata, and on the other hand it is all rather easy to use for informed users.

With uninformed users, I mean users that don't know git2rdata but do work with R a lot. People that hardy use R don't know .csv either, so for them it makes no difference, you have to explain anyway how they can open it. But users that do know R very well and are not informed on the fact that the git2rdata format is used to save the data, may just mess up a whole data repo when they are used to work with .csv without git2rdata. Of course it is possible to remove these commits afterwards, but it saves a lot of time (and frustration) for both maintainer and user if the user notices beforehand that these are not just .csv files. (I already came in this situation where a coworker fortunately asked some information beforehand, only because she was not familiar with .tsv.) And as always: don't suppose people read the the documentation before contributing, so I think it is good to use a less known format for git2rdata as a wake up call for contributors.

ElsLommelen avatar Oct 11 '21 11:10 ElsLommelen

@ElsLommelen the idea is to have two flavours of data formats. The optimized version remains .tsv and is intended for the hardcore users that prefer efficiency over human readibility. The non-optimized version is intended for case where the file should be easy to read by a larger audience. Therefore I'll switch to .csv in such case.

Note that changes outside of git2rdata are detected, regardless the file format (tsv or csv). When you place the files under version control, you can always revert the changes made by the user. Note that updating the data without removing or adding variables or their order is possible by design.

ThierryO avatar Nov 03 '21 18:11 ThierryO

I think the frictionless R package is now mature enough (submitted for peer review, CRAN submission after that) to take the next step in implementing the function suggested in this issue in git2rdata.

See https://github.com/ropensci/git2rdata/issues/66#issuecomment-870600599 for the initial design discussion. With frictionless the following would be possible add a datapackage.json with the correct data type as such:

library(frictionless)
library(git2rdata)
library(magrittr)

# Create a data frame
df_original <- data.frame(
  id = c(as.integer(1), as.integer(2)),
  timestamp = c(
    as.POSIXct("2020-03-01 12:00:00", tz = "EET"),
    as.POSIXct("2020-03-01 18:45:00", tz = "EET")
  ),
  life_stage = factor(c("adult", "adult"), levels = c("adult", "juvenile"))
)

# Write to vc
git2rdata::write_vc(df_original, "df")

# Read with vc
df_returned <- git2rdata::read_vc("df")

# Create Frictionless Package
package <-
  create_package() %>%
  add_resource(
    resource_name = "df",
    data = "df.tsv",
    schema = create_schema(df_returned), # Use df_returned to pass on all data type properties
    delim = "\t"
  )

# Write Frictionless Data Package to disk
write_package(package) # This will not overwrite existing files

This can be wrapped in a function where one provides the data resources to be bundled:

make_datapackage(name = "my-dataset", license = "CC0-1.0", resources = c("dep.tsv", "obs.tsv"))

peterdesmet avatar Jan 17 '22 16:01 peterdesmet