dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

Feature Request: Simplify dataset metadata JSON files for dataset creation or import

Open DS-INRAE opened this issue 1 year ago • 5 comments

Overview of the Feature Request Remove elements from the dataset creation json file that are superfluous

What kind of user is the feature intended for? API User

What inspired the request? JSON files are long, complex and intimidating for new users.

What existing behavior do you want changed? Remove the need of the following attributes in the dataset JSON files :

  • typeClass for metadata fields
  • multiple for metadata fields
  • displayName for metadatablocks

JSON files comparison Current Darwin Finches JSON for the fields title, author, datasetContact, dsDescription, subject :

{
  "datasetVersion": {
    "license": {
      "name": "CC0 1.0",
      "uri": "http://creativecommons.org/publicdomain/zero/1.0"
    },
    "metadataBlocks": {
      "citation": {
        "fields": [
          {
            "value": "Darwin's Finches",
            "typeClass": "primitive",
            "multiple": false,
            "typeName": "title"
          },
          {
            "value": [
              {
                "authorName": {
                  "value": "Finch, Fiona",
                  "typeClass": "primitive",
                  "multiple": false,
                  "typeName": "authorName"
                },
                "authorAffiliation": {
                  "value": "Birds Inc.",
                  "typeClass": "primitive",
                  "multiple": false,
                  "typeName": "authorAffiliation"
                }
              }
            ],
            "typeClass": "compound",
            "multiple": true,
            "typeName": "author"
          },
          {
            "value": [ 
                { "datasetContactEmail" : {
                    "typeClass": "primitive",
                    "multiple": false,
                    "typeName": "datasetContactEmail",
                    "value" : "[email protected]"
                },
                "datasetContactName" : {
                    "typeClass": "primitive",
                    "multiple": false,
                    "typeName": "datasetContactName",
                    "value": "Finch, Fiona"
                }
            }],
            "typeClass": "compound",
            "multiple": true,
            "typeName": "datasetContact"
          },
          {
            "value": [ {
               "dsDescriptionValue":{
                "value":   "Darwin's finches (also known as the Galápagos finches) are a group of about fifteen species of passerine birds.",
                "multiple":false,
               "typeClass": "primitive",
               "typeName": "dsDescriptionValue"
            }}],
            "typeClass": "compound",
            "multiple": true,
            "typeName": "dsDescription"
          },
          {
            "value": [
              "Medicine, Health and Life Sciences"
            ],
            "typeClass": "controlledVocabulary",
            "multiple": true,
            "typeName": "subject"
          }
        ],
        "displayName": "Citation Metadata"
      }
    }
  }
}

Simplified JSON file :

{
  "datasetVersion": {
    "license": {
      "name": "CC0 1.0",
      "uri": "http://creativecommons.org/publicdomain/zero/1.0"
    },
    "metadataBlocks": {
      "citation": {
        "fields": [
          {
            "value": "Darwin's Finches",
            "typeName": "title"
          },
          {
            "value": [
              {
                "authorName": {
                  "value": "Finch, Fiona",
                  "typeName": "authorName"
                },
                "authorAffiliation": {
                  "value": "Birds Inc.",
                  "typeName": "authorAffiliation"
                }
              }
            ],
            "typeName": "author"
          },
          {
            "value": [ 
                { "datasetContactEmail" : {
                    "typeName": "datasetContactEmail",
                    "value" : "[email protected]"
                },
                "datasetContactName" : {
                    "typeName": "datasetContactName",
                    "value": "Finch, Fiona"
                }
            }],
            "typeName": "datasetContact"
          },
          {
            "value": [ {
               "dsDescriptionValue":{
                "value":   "Darwin's finches (also known as the Galápagos finches) are a group of about fifteen species of passerine birds.",
               "typeName": "dsDescriptionValue"
            }}],
            "typeName": "dsDescription"
          },
          {
            "value": [
              "Medicine, Health and Life Sciences"
            ],
            "typeName": "subject"
          }
        ]
      }
    }
  }
}

Are you thinking about creating a pull request for this feature?
Even if this would help increase APIs adoption, we have other priorities at the moment.

DS-INRAE avatar Oct 23 '24 09:10 DS-INRAE

Note: a more radical simplification would be very interesting, but hopefully this would be an easier quick win.

DS-INRAE avatar Oct 23 '24 09:10 DS-INRAE

Note that the metadata input for the semantic API would look like (using a (~standard) @context for readability):

{
  "title":"Darwin's Finches",
  "author": {
    "citation:authorName": "Finch, Fiona",
    "citation:authorAffiliation": "Bird's Inc."
  },   
  "citation:datasetContact": {
    "citation:datasetContactName": "Finch, Fiona",
    "citation:datasetContactEmail": "[email protected]"
  },
  "citation:dsDescription": {
    "citation:dsDescriptionValue": "Darwin's finches (also known as the Galápagos finches) are a group of about fifteen species of passerine birds."
  },
  "subject": "Medicine, Health and Life Sciences",
  "@context": {
    "author": "http://purl.org/dc/terms/creator",
    "citation": "https://dataverse.org/schema/citation/",
    "subject": "http://purl.org/dc/terms/subject",
    "termName": "https://schema.org/name",
    "title": "http://purl.org/dc/terms/title"
  }
}

or, even shorter,

{
  "http://purl.org/dc/terms/title":"Darwin's Finches",
  "http://purl.org/dc/terms/creator": {
    "https://dataverse.org/schema/citation/authorName": "Finch, Fiona",
    "https://dataverse.org/schema/citation/authorAffiliation": "Bird's Inc."
  },   
  "https://dataverse.org/schema/citation/datasetContact": {
    "https://dataverse.org/schema/citation/datasetContactName": "Finch, Fiona",
    "https://dataverse.org/schema/citation/datasetContactEmail": "[email protected]"
  },
  "https://dataverse.org/schema/citation/dsDescription": {
    "https://dataverse.org/schema/citation/dsDescriptionValue": "Darwin's finches (also known as the Galápagos finches) are a group of about fifteen species of passerine birds."
  },
  "http://purl.org/dc/terms/subject": "Medicine, Health and Life Sciences",
}

qqmyers avatar Oct 23 '24 10:10 qqmyers

This is what I've suggested to @JR-1991 who has slides ready about the gnarly complicated native format, to try the semantic API. 😄

See also discussion here:

  • #3068

pdurbin avatar Oct 23 '24 14:10 pdurbin

@pdurbin, it is on my bucket list 😁 Can this also be passed to the dataset creation/edit endpoint?

JR-1991 avatar Oct 23 '24 15:10 JR-1991

@JR-1991 well, you have to pass 'Content-Type: application/ld+json'. Please see the guides: https://guides.dataverse.org/en/6.4/developers/dataset-semantic-metadata-api.html

pdurbin avatar Oct 23 '24 15:10 pdurbin

Please, please, please create a JSON schema for any rework of the metadata and use something like https://rjsf-team.github.io/react-jsonschema-form/ to enforce it on the UI side. Not having a JSON schema for the current JSON metadata used to create a dataset is extremely frustrating.

kuhlaid avatar Dec 17 '24 19:12 kuhlaid

@kuhlaid that makes total sense. Please see this issue:

  • https://github.com/IQSS/dataverse-pm/issues/26

It was split into these: sub-issues:

  • https://github.com/IQSS/dataverse/issues/9463
  • https://github.com/IQSS/dataverse/issues/9464
  • https://github.com/IQSS/dataverse/issues/9465
  • https://github.com/IQSS/dataverse/issues/10169

Which resulted in these pull requests:

  • https://github.com/IQSS/dataverse/pull/10109
  • https://github.com/IQSS/dataverse/pull/10543

The latest docs are here:

  • https://guides.dataverse.org/en/6.5/api/native-api.html#retrieve-a-dataset-json-schema-for-a-collection
  • https://guides.dataverse.org/en/6.5/api/native-api.html#validate-dataset-json-file-for-a-collection

Do those docs help? Thanks!

pdurbin avatar Dec 17 '24 19:12 pdurbin

I guess what I was looking for was the 'dataset-schema.json' file (which is impossible to find using the Sphinx docs search). I'm fairly certain this schema does not sufficiently define the metadata that is allowed to be used. The UI uses very explicit elements such as Author Identifier Type and does not seem to allow for values outside of the defined elements in the UI dropdown list. If that is the case then any 'out of bounds' data should be well defined within the schema. If Author Identifier Type for example is limited to ORCID, ISNI, etc. then those should probably be enumerated within the schema. The current 'dataset-schema.json' file is missing details on explicit elements found in the UI.

kuhlaid avatar Dec 17 '24 20:12 kuhlaid

@kuhlaid yeah, my fear is that what we're offering is not complete enough. As you can see, all those issues and PRs above have been merged. Would you be able to open a fresh issue explaining what would be helpful to you?

pdurbin avatar Dec 17 '24 20:12 pdurbin