dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

7844 codemeta schema

Open poikilotherm opened this issue 4 years ago • 11 comments
trafficstars

What this PR does / why we need it: This is adding the CodeMeta Schema as a default out of the box schema for (new) installations. This pull request is a first step. Please see the discussion points below for your review. We need to be careful about the scope of this first step to keep compatibility in mind. (There is no schema migration present in the Dataverse application, so when changing data types etc, we need to write SQL database migrations manually!)

** TODOs**

  • [ ] Test TSV, make screenshots
  • [ ] Sort out remaining questions (see below)

Which issue(s) this PR closes:

Closes #7844

Special notes for your reviewer:

  1. Should we use the W3C proposed vocabulary for applicationCategory?
    • Should we go ahead and add ResearchApplication to this list and reach out to schema.org and CodeMeta people to push for adding it to the list? (Maybe Google, too?)
    • Should we go ahead and reach out to CodeMeta about a field on scientific method used in the software? (Not covered by subject field, which is very coarse anyway)
  2. Should we make the *Requirements fields use integer values of byte? kilobyte? megabyte? (or similar for CPU) instead of arbitrary text values?
  3. Do we want to add docs about the crosswalk of "Dataverse Metadata" to "CodeMeta" to the guides?
  4. What other docs do we want to include?
  5. Do we want to add https://github.com/SoftwareUnderstanding/software_types (which would extend this beyond pure CodeMeta)
  6. Do we want to add a field to allow documenting computational methods in use?
    • There is no standard, vocabulary, schema or ontology for this yet, we'd be on our own.
    • This might as well be done via Citation Blocks Keywords
    • We could leave this for a later extension of the block

Suggestions on how to test this:

  • Load the TSV via the usual API call.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

  • Nope.

Is there a release notes update needed for this change?:

  • Yet to be done, as review/extension/discussion needed.

Additional documentation:

  • Block as TSV: https://docs.google.com/spreadsheets/d/1MsJifbLeRYCdFUXPAOh-KIphgTaxPsc6ICxtmEj_sb4/edit#gid=1781064623
  • Mapping CodeMeta to Dataverse software metadata fields: https://docs.google.com/spreadsheets/d/1zcOm1PX2_HTMacgc-sn8aSKTIQBtoaSAAKEPXmKVt3A/edit#gid=0

Tagging @doigl @atrisovic @4tikhonov @jggautier @djbrooke @pdurbin (I don't know the GH names of the other WG members)

poikilotherm avatar May 17 '21 13:05 poikilotherm

Coverage Status

Coverage remained the same at 19.326% when pulling fcc36d01dfac1f1e256ecf4bb2a88d220d939f0e on poikilotherm:7844-codemeta-schema into 2dbf9b70566c5a7a23a06058b206cf2e13f69ed6 on IQSS:develop.

coveralls avatar May 17 '21 13:05 coveralls

This is great! I'd like to point out two issues that I think are most pressing and I hope could be resolved before this is merged:

  1. I think there's a stray tab in line 11, applicationCategory, that's splitting its displayName into two columns:

    Screen Shot 2021-05-17 at 10 55 40 AM
  2. There are a few different ways that the Citation metadatablock's fields are still designed to describe data as opposed to software. It looks like we'll be tackling these issues in future work (such as how metadata is exported), but I hope some of the issues that users will see when depositing software can be resolved:

    • The tooltips for most of the fields, even fields that make sense for describing software, such as Title, include the word "Dataset".

      One solution might be to generalize the text in the tooltips of the fields in the Citation metadatablock, for example by replacing the word "Dataset" with "deposit".

    • Some of the fields wouldn't make sense for describing software at all, such as "Series", "Date of collection" and "Type of data". If someone is depositing software, I would think they wouldn't need to see these fields.

      To prevent depositors of software from seeing fields that they wouldn't need, one solution might be to recommend that repositories use a Dataverse collection only for software deposits, and when setting up that collection they should hide the fields in the Citation metadatablock that describe data (e.g. "Data of collection" and "Type of data") and enable the Software metadatablock for that Dataverse collection. So repositories, especially "self curated" ones, should not have users deposit a mix of datasets and software into the same Dataverse collection because there wouldn't be a way for the Dataverse software to know if what the user is depositing is data or software, so the Dataverse software has no way of showing the relevant metadata fields.

jggautier avatar May 17 '21 15:05 jggautier

I once had the idea to actually make the citation block pluggable, non-mandatory. I know this requires A LOT, but maybe it's a way to go, if we don't want other archictural changes like abstracting the concept of sets.

However, this seems beyond scope. Thanks for the pointer for the description issue, I'll fix that right away.

poikilotherm avatar May 17 '21 17:05 poikilotherm

@doigl has some data about the candidates for displayOnCreate from DaRuS:

grafik

I agree on all of those, except for applicationCategory, which has been used not with the vocabulary from W3C but free text. I still think we should not do that.

poikilotherm avatar May 18 '21 11:05 poikilotherm

There is a list of programming languages in WikiData, containing ~1500 entries. (Via https://en.wikiversity.org/wiki/Research_in_programming_Wikidata/Programming_languages)

There is an extensive list of operating systems (not names alone) in WikiData with ~1100 entries. (Via https://en.wikiversity.org/wiki/Research_in_programming_Wikidata/Operating_systems)

We might wanna play with the OS query to select only instances that are not a subclass of another OS and not "based on" to gain the top level ones only.

poikilotherm avatar May 18 '21 11:05 poikilotherm

I checked on the autocomplete/filtering support for controlled vocabulary fields in compound fields. Here's what I found:

  1. For primitive fields using a CV we use a filter input in the dropdown in case of a "check multiple" metadata field, but not for single values. This has been done as part of #6000 / PR #6339 (I knew there was an old issue for this... :smile: )
  2. This change has not been introduced for "check multiple" in compound fields. No idea why. Tagging @mheppler here.
  3. The remaining issue of single value fields has never been addressed, but @TaniaSchlatter mentioned a few thoughts.

I guess adding the filter functionality to the "check multiple" fields in compound fields is an easy way forward. As this seems like a good discussion for Dataverse software decoupled from this issue about CodeMeta, I'm going to create that little issue now. :arrow_right_hook:#7888

After revisiting the schema, I see that the field operatingSystem is "allow multiple", but the (potential) CV field for the OS name would - of course - not be "check multiple". So we still need a solution for number 3 above, if we want this. :arrow_right_hook:#7889

poikilotherm avatar May 20 '21 10:05 poikilotherm

@poikilotherm as Codemeta is close to version 3.0 (https://blog.datacite.org/codemeta-we-need-your-feedback/), applicationCategory and scientific method are good topics to discuss now. Would the Dataverse community want them to become part of Codemeta?

And what is the timing for this pull request with regards to Codemeta 2.0 vs. Codemeta 3.0 (which is still a few months away)?

mfenner avatar May 27 '21 04:05 mfenner

@poikilotherm as Codemeta is close to version 3.0 (https://blog.datacite.org/codemeta-we-need-your-feedback/), applicationCategory and scientific method are good topics to discuss now. Would the Dataverse community want them to become part of Codemeta?

@mfenner I think there is a high demand for these fields not only within the boundaries of the Dataverse community. I know that @sdruskat is also looking into this matter for his PhD thesis.

Are you aware of any existing, reusable controlled vocabularies, preferably as RDF/SKOS/JSON-LD/sth. with a PID, we could reuse for a field like scientificMethod? Dataverse soonish will have support to use those kind of sources within the UI (#7712)

And what is the timing for this pull request with regards to Codemeta 2.0 vs. Codemeta 3.0 (which is still a few months away)?

I'm not so sure about this. Maybe it would be a good start to create a schema for 2.0 now and upgrade to 3.0 later on. It's a rather low hanging fruit. It might become necessary to introduce a migration method in Dataverse, but this seems like a good addition beyond the CodeMeta use case.

poikilotherm avatar May 27 '21 11:05 poikilotherm

@poikilotherm I couldn't get this tsv to load without making a few changes. I put them in a pull request for you to review and perhaps merge: https://github.com/poikilotherm/dataverse/pull/553

pdurbin avatar Jul 21 '22 19:07 pdurbin

Thanks @pdurbin!

Just today I picked up working on this again (not yet pushed).

There's lots of stuff to be moved around, which will also incorporate your changes😉

poikilotherm avatar Jul 21 '22 19:07 poikilotherm

Hi @doigl @atrisovic @4tikhonov @jggautier @pdurbin @adaybujeda @mmshad I just updated this PR with an improved block incorporating the latest news and decisions we had for this.

I also cleaned up the first comment to keep only the remaining questions (the old ones can still be found via the edit history). Please take a look. I guess we should discuss this on Community Slack and/or have a meeting of the WG again :smile:

poikilotherm avatar Jul 22 '22 08:07 poikilotherm

Can we work on a plan for active user testing of the metadata block after it's merged? The appendix section of the guides says that feedback via any channel is welcome, but I'm worried that this isn't enough to move this from the new experimental label.

I think the metadata block style guide should be used to improve the labels and tooltip texts for this metadata block, too (and using it can help improve that style guide). Maybe this can be done in another step after it's been merged, like as part of another GitHub issue?

jggautier avatar Nov 14 '22 16:11 jggautier

Thank you for your work on this! (sorry I know Dataverse only as a user ...)

  • Software license: would this be selected/specified outside of these new forms? In the "normal" forms?
  • What if a record/archive contains both data and code and the data has different license requirements (typically creative-commons) than software (typically not creative-commons). Would people be able to specify multiple licenses?

bast avatar Nov 14 '22 22:11 bast

* **Software license**: would this be selected/specified outside of these new forms? In the "normal" forms?

* What if a record/archive contains both data and code and the data has different license requirements (typically creative-commons) than software (typically not creative-commons). Would people be able to specify multiple licenses?

@bast raises an important point here, something we face in the Citation File Format as well. The latter currently supports multiple licenses, albeit without scope. @schlauch suggested the use of SPDX expressions. So for a mixed data + software deposit, this could be expressed as something like MIT AND CC-BY-4.0. License scope is hard to solve, but can be done in the deposit artifacts as well using, e.g., REUSE.

sdruskat avatar Nov 15 '22 08:11 sdruskat

Can we work on a plan for active user testing of the metadata block after it's merged? The appendix section of the guides says that feedback via any channel is welcome, but I'm worried that this isn't enough to move this from the new experimental label.

@jggautier I hope to see this metadata block enabled on https://demo.dataverse.org! We (project HERMES) are planning on creating demo workflows with that instance and make good use of this metadata block!

I think the metadata block style guide should be used to improve the labels and tooltip texts for this metadata block, too (and using it can help improve that style guide). Maybe this can be done in another step after it's been merged, like as part of another GitHub issue?

Absolutely. I wasn't aware of this guide, but let's use it to polish things. I love the idea of making incremental steps, but if you feel like creating a review with suggestions, we could even incorporate changes upfront!

  • Software license: would this be selected/specified outside of these new forms? In the "normal" forms?

@bast this PR is about a metadata block. Licenses in Dataverse are a completely separate functionality and will not (and shall not) be affected by this PR.

  • What if a record/archive contains both data and code and the data has different license requirements (typically creative-commons) than software (typically not creative-commons). Would people be able to specify multiple licenses?

As @sdruskat layed out, this is a tricky thing on multiple different layers. I fear a solution to this is far beyond the possibilities and scope of this PR adding missing fields for CodeMeta support. You are very welcome to join the Dataverse Metadata Working Group meetings - we have been discussing #8512 last time, which goes somewhat into the direction of your question.

poikilotherm avatar Nov 15 '22 12:11 poikilotherm

Thanks @poikilotherm.

It makes sense that the metadata guidelines weren't considered. Work on this CodeMeta metadata block came way before these metadata guidelines were released.

I don't have time right now to apply the guidelines to the metadatablock and wouldn't want to hold up progress on it. So I agree it can be done incrementally, especially since the metadatablock is being labelled as experimental. These guidelines also need to be applied to other metadata fields that already ship with Dataverse, so I'll open a GitHub issue about that.

@jggautier I hope to see this metadata block enabled on https://demo.dataverse.org/! We (project HERMES) are planning on creating https://github.com/hermes-hmc/workflow/issues/59 with that instance and make good use of this metadata block!

So you hope to use Demo Dataverse with this metadatablock enabled on it in order to do testing with and collect feedback from depositors, curators, etc as part of project HERMES? Would you be able to share the results of this testing with the community? Then maybe the community could discuss what needs to be done for the metadata block to not be labelled as experimental.

jggautier avatar Nov 28 '22 16:11 jggautier

Hi @jggautier!

It makes sense that the metadata guidelines weren't considered. Work on this CodeMeta metadata block came way before these metadata guidelines were released.

Agreed! :wink:

I don't have time right now to apply the guidelines to the metadatablock and wouldn't want to hold up progress on it. So I agree it can be done incrementally, especially since the metadatablock is being labelled as experimental. These guidelines also need to be applied to other metadata fields that already ship with Dataverse, so I'll open a GitHub issue about that.

Sounds good to me!

@jggautier I hope to see this metadata block enabled on https://demo.dataverse.org/! We (project HERMES) are planning on creating hermes-hmc/workflow#59 with that instance and make good use of this metadata block!

So you hope to use Demo Dataverse with this metadatablock enabled on it in order to do testing with and collect feedback from depositors, curators, etc as part of project HERMES? Would you be able to share the results of this testing with the community? Then maybe the community could discuss what needs to be done for the metadata block to not be labelled as experimental.

Yes, we'd love to see that! It's always good to demonstrate workflows and it would be lovely if we can push to a "research software ready" Dataverse installation (and other can try it, too)!

@pdurbin @mreekie I just added a release note docs to this PR and merged latest develop. As we have consensus of community (some folks reached out, no showstoppers), Metadata Working Group and IQSS, can we please go ahead and move this towards the SPRINT columns? It's small, very low risk, and was developed on HERMES funding. Thank you!

poikilotherm avatar Nov 30 '22 08:11 poikilotherm

This is a tsv file. The only thing schema.xml is changed. For experimental docs do we pre-fill schema.xml?

Caution here

  • do we want to pre-fill the schema.xml?
  • In the past we got community pushback when items were there that they did not want.
  • If a customer runs the update script, it might clean this up. But not everyone likes to run that script.

Next Step:

  • Discuss with Oliver the the idea of putting it in schema.xml

mreekie avatar Dec 12 '22 16:12 mreekie

We did for the experimental workflows metadata.

This is meant to become a major feature, Dataverse being a research software ready repo. The block is meant to mature and people should adopt it so we can make it fabulous.

poikilotherm avatar Dec 12 '22 17:12 poikilotherm

Huh. @poikilotherm is right. We did update schema.xml when we added the experimental computational workflow metadata block:

  • #8812

pdurbin avatar Dec 12 '22 20:12 pdurbin

As discussed just now, the PR looks go overall.

We'd like to back out the schema.xml change.

(We'll make a PR to back out the schema.xml change for computational workflow as well, for consistency.)

Seems like @jggautier has given his blessing, especially since it's experimental.

Giving it a size of 10. This does not include deploying to demo.

pdurbin avatar Dec 13 '22 16:12 pdurbin

@pdurbin @mreekie I just pushed the necessary changes to revert the addition to the schema. Also updated to latest develop. Dunno why the RTD CI fails, but seems unrelated.

poikilotherm avatar Dec 13 '22 17:12 poikilotherm

(We'll make a PR to back out the schema.xml change for computational workflow as well, for consistency.)

Chop chop here we go https://github.com/IQSS/dataverse/pull/9225

poikilotherm avatar Dec 13 '22 19:12 poikilotherm

added to sprint Dec 15, 2022

mreekie avatar Dec 14 '22 21:12 mreekie

I was pinged a while back but thought I should reply now that I finally found the time to answer after the winter break.

We'd like to back out the schema.xml change.

(We'll make a PR to back out the schema.xml change for computational workflow as well, for consistency.)

Seems like @jggautier has given his blessing, especially since it's experimental.

I'm not sure what the schema.xml change was and how that's related to this being experimental. Is that what I gave my blessing to? Is the effect of the schema.xml change that this won't be a default metadatablock in future Dataverse installations? Does that mean that experimental, as it's been used for this and the workflow metadatablock, means that it'll be included in a release but the feature won't be turned on by default in Dataverse installations?

I agree about more feedback earlier in the process (and @poikilotherm has been using many opportunities over the years to encourage feedback), and I'd like to add that I think it's important to plan, as early in the process as possible, for evaluating solutions after they've been merged, too, even more so if we're so uncertain about a solution that we label it experimental.

jggautier avatar Jan 12 '23 17:01 jggautier

@jggautier you probably missed the discussion but to sum up, only changes to non-experimental blocks should result in a change to schema.xml.

That is, schema.xml contains field for all the block that we ship. All these blocks are enabled by default and will "just work" because schema.xml has the fields already.

I hope this helps. This whole experimental blocks concept is quite new, of course!

pdurbin avatar Jan 13 '23 20:01 pdurbin

Ah thanks. That's how I understood it. Experimental metadatablocks shouldn't be enabled in installations by default when those installations use the version of the software that includes that experimental metadatablock. Those installations will need to take extra steps to enable it.

It's just not clear to me how a metadatablock becomes not experimental.

jggautier avatar Jan 13 '23 20:01 jggautier

It hasn't happened yet! 😄 I hope we find out with CodeMeta!

pdurbin avatar Jan 13 '23 20:01 pdurbin