dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

Affiliations entered in affiliation fields are parenthesized in "Datacite" and Schema.org exports

Open jggautier opened this issue 2 years ago • 10 comments

When does this issue occur? When Dataverse creates "Datacite" and Schema.org metadata exports for datasets that have values in a few Affiliation fields in the Citation metadatablock

Which page(s) does it occurs on? Metadata exports and OAI-PMH feed

What happens? The affiliation metadata that depositors add to their datasets, e.g. Author Affiliation, Point of Contact Affiliation, Producer Affiliation, appears in the "Datacite" and Schema.org exports wrapped in parenthesis.

The "Datacite" export has these affiliation fields:

  • Author Affiliation
  • Point of Contact Affiliation
  • Producer Affiliation

The Schema.org export has this affiliation field:

  • Author Affiliation

To whom does it occur (all users, curators, superusers)? All users. It probably affects search, such as when using facets to narrow search results

What did you expect to happen? The affiliation metadata would appear in the exports without the added parentheses

Which version of Dataverse are you using? 5.12.1

Any related open or closed issues to this bug report? The issues related to using an algorithm to guess if the names entered in the author metadata field are people or organizations: https://github.com/IQSS/dataverse/issues/7349 and https://github.com/IQSS/dataverse/issues/5029. Will the PR to address those issues, https://github.com/IQSS/dataverse/pull/9089, remove the parenthesis? I think it might since the Schema.org exports that QDR's Dataverse fork creates already use the algorithm, and in their Schema.org exports, author affiliations aren't wrapped in parentheses, e.g. their Schema.org export at https://data.qdr.syr.edu/api/datasets/export?exporter=schema.org&persistentId=doi:10.5064/F6G3T1PF

Screenshots:

How the affiliations of the Author, Point of Contact, and Producer fields in Datacite export of the dataset at https://doi.org/10.7910/DVN/MUJHGR (published in Harvard Dataverse ):

  • Screen Shot 2023-01-26 at 11 10 22 AM
  • Screen Shot 2023-01-26 at 11 06 42 AM

How the affiliations of the Author field appears in Schema.org export of the dataset at https://doi.org/10.7910/DVN/MUJHGR (published in Harvard Dataverse ):

  • Screen Shot 2023-01-26 at 11 08 21 AM
  • Screen Shot 2023-01-26 at 11 08 29 AM (Both the "author" and "creator" properties are used to repeat author metadata in the Schema.org export because of an experiment unrelated to this issue about parentheses. See https://github.com/IQSS/dataverse/issues/5029#issuecomment-456489325)

Definition of done: When the affiliation metadata is not wrapped in parenthesis when it appears in metadata exports

jggautier avatar Jan 26 '23 16:01 jggautier

This bug with the parentheses exists in the Schema.org exports of older dataset versions but not in the Schema.org exports of more recently published dataset versions:

  • The latest version of the dataset at https://doi.org/10.7910/DVN/MUJHGR was published in 2016 and in its Schema.org export, the author affiliations have the parentheses.
  • The latest version of the dataset at https://doi.org/10.7910/DVN/FV9AVG was published in 2020 and in its Schema.org export, the author affiliations don't have parentheses.

As far as I can tell, in the Schema.org exports of all datasets published more recently, the author affiliations don't have parentheses.

I think this problem might be related to the discussion in https://github.com/IQSS/dataverse/issues/5144, where we talked about how to make sure that when we make changes to how Dataverse adds metadata to the DataCite metadata export, we ensure that the datasets published before those changes were made have their exports updated.

The same should be true for the Schema.org export and other exports. In the Schema.org export of a dataset published today, we can see changes that were made when v5.13 was applied to Harvard Dataverse. Those changes don't show up in those two dataset exports I mentioned earlier and probably many datasets in Harvard Dataverse whose latest versions were published before v5.13 was applied to Harvard Dataverse.

jggautier avatar Apr 04 '24 14:04 jggautier

This bug still exists in v6.2. Is it possible to fix it? As a result of this bug, the metadata of all DOIs registered with Datacite are also incorrect.

lmaylein avatar Jul 04 '24 14:07 lmaylein

Hi @lmaylein. Thanks for asking! I think that the more recent work described in the GitHub issue at https://github.com/IQSS/dataverse/issues/5889 will fix this bug. Specifically, the OpenAIRE export doesn't include these parentheses, so in a comment in that GitHub issue I proposed that the merged export also wouldn't include the parentheses around the affiliations of the Author metadata field. And I imagine that parentheses will not be included around the affiliations of the other fields that describe people or organizations, too, such as Point of Contact, Contributor, Producer, and Distributor.

jggautier avatar Jul 08 '24 17:07 jggautier

As far as I can tell, in the Schema.org exports of all datasets published more recently, the author affiliations don't have parentheses.

Is the fix to re-export datasets? https://guides.dataverse.org/en/6.3/admin/metadataexport.html#batch-exports-through-the-api

Do we know which PR fixed it, by removing the parentheses (if it is indeed fixed)?

pdurbin avatar Jul 08 '24 19:07 pdurbin

Schema.org was fixed in #9089. The problem for DataCite is that the displayValue for affiliation is sent to DataCite - see https://github.com/IQSS/dataverse/blob/a466c97d02e84160c75529b915bda5c664e38ec9/src/main/java/edu/harvard/iq/dataverse/pidproviders/doi/XmlMetadataTemplate.java#L163. I'm addressing it in #10615, #10632 (which need updates), but it could be addressed separately, or ~worked around by removing the parens in the formatting at https://github.com/IQSS/dataverse/blob/a466c97d02e84160c75529b915bda5c664e38ec9/scripts/api/data/metadatablocks/citation.tsv#L13 and resending the metadata to DataCite using the API (and assuming display without parens is OK).

qqmyers avatar Jul 08 '24 19:07 qqmyers

Oh, the displayValue. Thanks.

Hmm, I assume the parens are there in the displayValue for a reason. That is, we probably shouldn't remove them.

@qqmyers I'm fine with waiting for one of your PRs above. If you address this bug in one of them, please use the normal "closes #9330" syntax so this issue goes through QA.

pdurbin avatar Jul 08 '24 20:07 pdurbin

@qqmyers now that #10632 is merged, should this issue be closed :slightly_smiling_face: ?

DS-INRAE avatar Oct 23 '24 12:10 DS-INRAE

Probably - in general, @jggautier is looking at which of the DataCite related issues can close and which either need to be rescoped after #10632.

qqmyers avatar Oct 23 '24 12:10 qqmyers

Great, we'll see what Julian says when he gets to this one then :grinning:

DS-INRAE avatar Oct 23 '24 12:10 DS-INRAE

I just had stumbled upon the issue in our board and wondered if it had been forgoted

DS-INRAE avatar Oct 23 '24 12:10 DS-INRAE

Hey all. Affiliations entered in affiliation fields are still parenthesized in the DataCite and Schema.org exports of datasets published by some Dataverse repositories.

The DataCite and Schema.org exports of the dataset that I included as an example in this GitHub issue's first comment have affiliations that are wrapped in parentheses.

I checked the oldest dataset in each of the other 17 known Dataverse installations that are using v6.4 as of this writing. As far as I can tell the affiliations in their datasets' Schema.org exports don't have parenthesis. But two of those installations also have datasets whose DataCite exports have Author and Point of Contact Affiliations that are wrapped in parentheses:

In Repositorio de Datos Abiertos de Investigación (Redata), the most recently created or updated datasets, and maybe all datasets in Redata, have DataCite exports that have Author and Point of Contact Affiliations that are wrapped in parentheses.

The oldest datasets in the other 15 installations that are on v6.4 either have affiliations that are not wrapped in parentheses or don't have Author Affiliation metadata. I haven't checked for Point of Contact and Producer Affiliations.

jggautier avatar Dec 16 '24 19:12 jggautier

2025/03/10

  • @scolapasta will follow up as per our discussion during Prioritization meeting (e.g., review the code/release notes and request devops to reexport HDV datasets)

cmbz avatar Mar 10 '25 14:03 cmbz

The release notes for 6,4 specificaly say:

  1. Run reExportAll to update dataset metadata exports

This step is necessary because of changes described above for the Datacite and oai_dc export formats.

Below is the simple way to reexport all dataset metadata. For more advanced usage, please see the guides.

curl http://localhost:8080/api/admin/metadata/reExportAll

So any installation running 6.4, and having properly gone through the release process should not have any datasets with this issue.

@jggautier for Harvard Dataverse, were you aware of any? I assume we ran through this step, so would think that there is no devops needed here.

For other installations (you mentioned two that you found), they could be contacted and referred to the instruction above - as you can imagine, we don't have access to their admin APIs.

scolapasta avatar Mar 10 '25 17:03 scolapasta

Hey @scolapasta. The dataset at https://doi.org/10.7910/DVN/MUJHGR has a "Datacite" export and a Schema.org export with Author Affiliation values in parentheses.

The same is true for the dataset at https://doi.org/10.7910/DVN/YVTEFF, and I'd guess most datasets in Harvard Dataverse whose latest published versions were published before a certain date.

Is that what you meant?

jggautier avatar Mar 11 '25 18:03 jggautier

@jggautier So I went ahead and rexported the first one and the new export is now correct (feel free to verify).

So it does seem like we missed re-export when we released. Regardless I did confirm (as mentioned above) that it was in the release notes, so nothing to add there.

My suggestion is, rather then rexport all now, we wait for the 6.6 release, which will also require a rexport. That we we don't run this twice in such a short period.

scolapasta avatar Mar 11 '25 19:03 scolapasta

Cool yeah. I can see that the parentheses are gone now.

About other Dataverse installations, like I wrote in #5144, I think it's important that we understand why this is happening, and contacting folks from these other installations could be a good chance to learn why and might help us figure out how to make sure it happens less often and that repository users benefit from these changes.

I'll try to find time this week to contact other affected installations unless you let me know today or tomorrow if you'd like to instead.

jggautier avatar Mar 11 '25 19:03 jggautier

I used the APIs to find the oldest published datasets with Author Affiliation metadata in the 34 known Dataverse installations that are running Dataverse v6.4 or later, and to look at the Author Affiliation metadata in the Schema.org and "Datacite" exports of those datasets.

10 of these 34 installations have parentheses wrapping the Author Affiliation metadata in at least some of their Schema.org exports or "Datacite" exports or both.

These are the two installations with these parentheses in at least some of their Schema.org and "Datacite" exports:

dataverse.harvard.edu
	doi:10.7910/DVN/SPJAFZ - Last updated: 2015-04-11T08:26:42Z
	Author affiliations in Schema.org export:
		(University of California, Irvine)
		(University of California, Irvine)
		(Purdue University)
	Author affiliations in "Datacite" export:
		(University of California, Irvine)
		(University of California, Irvine)
		(Purdue University)

rdr.kuleuven.be
	doi:10.48804/V5CGMS - Last updated: 2022-05-02T15:18:18Z
	Author affiliations in Schema.org export:
		(Associatie KU Leuven)
	Author affiliations in "Datacite" export:
		(Associatie KU Leuven)

These are the eight installations with the parentheses in at least some of their "Datacite" exports but not in their Schema.org exports.

dataverse.rsu.lv
	doi:10.25143/FK2/HNMLHH - Last updated: 2020-12-08T10:06:27Z
	Author affiliations in "Datacite" export:
		(Department of Infectious diseases)
		(Department of Infectious diseases)
		(Department of Infectious diseases)
		(Department of Infectious diseases)

dataportal.ing.pan.pl
	doi:10.60871/INGPAN/HLKYO4 - Last updated: 2024-03-19T09:12:55Z
	Author affiliations in "Datacite" export:
		(Institute of Geological Sciences, Polish Academy of Sciences)
		(Institute of Geological Sciences, Polish Academy of Sciences)

redata.anii.org.uy
	doi:10.60895/redata/6PG4F4 - Last updated: 2024-11-08T15:17:06Z
	Author affiliations in "Datacite" export:
		(Universidad Católica del Uruguay)
		(Universidad Católica del Uruguay)
		(Universidad Católica del Uruguay)
		(Universidad Católica del Uruguay)
		(Universidad Católica del Uruguay)
		(Rootstrap)
		(Universidad Católica del Uruguay)
		(Nirakara Mindfulness Institute)
		(Administración Nacional de Educación Pública. Consejo de Formación en Educación)
		(Consejo de Formación en Educación)
		(Instituto de Investigaciones Biológicas Clemente Estable)

awf.rodbuk.pl
	doi:10.58145/AWF/KGFEN8 - Last updated: 2025-02-12T10:58:08Z
	Author affiliations in "Datacite" export:
		(Jagiellonian University)
		(University of Physical Culture in Krakow)
		(University of Physical Culture in Krakow)
		(University of Physical Culture in Krakow)

uken.rodbuk.pl
	doi:10.24917/UKEN/FQQXYY - Last updated: 2024-04-19T12:42:23Z
	Author affiliations in "Datacite" export:
		(Uniwersytet Komisji Edukacji Narodowej w Krakowie)

agh.rodbuk.pl
	doi:10.58032/AGH/XAAEWN - Last updated: 2023-03-03T17:05:02Z
	Author affiliations in "Datacite" export:
		(AGH University of Science and Technology ; Faculty of Computer Science, Electronics, and Telecommunications)

pk.rodbuk.pl
	doi:10.58099/pk/19SLYQ - Last updated: 2023-06-22T11:44:47Z
	Author affiliations in "Datacite" export:
		(Cracow University of Technology, Library)
		(Cracow University of Technology, Library)

uek.rodbuk.pl
	doi:10.58116/UEK/52616Y - Last updated: 2024-04-15T09:26:03Z
	Author affiliations in "Datacite" export:
		(Krakow University of Economics)
		(Medical University of Silesia)

The five "rodbuk" installations might be managed by the same folks. So I think that comes to five groups managing these installations (rdr.kuleuven.be, dataverse.rsu.lv, dataportal.ing.pan.pl, redata.anii.org.uy, the rodbuk installations) and I'll contact those groups.

There are also 90 known Dataverse installations that are not running Dataverse v6.4 or a later version and that have published datasets as of today (March 12, 2025). So when any of the managers of these 90 installations upgrade to Dataverse v6.4 or later, it's possible that they'll run into whatever challenges caused the managers of the above listed 10 installations to have "outdated" "Datacite" and "Schema.org" exports with parentheses wrapping their Author Affiliation metadata, and possibly other changes that have been made to these metadata exports in Dataverse versions that they haven't applied to their installations, yet.

jggautier avatar Mar 12 '25 22:03 jggautier

Today I contacted:

  • Dieuwertje, who works on KU Leuven's repository (rdr.kuleuven.be)
  • Someone who let us know about the Rīga Stradiņš University Institutional Repository Dataverse (dataverse.rsu.lv)
  • @JacekChudzik, who let us know about :
    • the IGS PAS Data Portal (dataportal.ing.pan.pl)
    • the RODBUK repositories, such as awf.rodbuk.pl
  • Federico Yemurenko, who works on Repositorio de Datos Abiertos de Investigación (Redata)
  • Leonid, who helped apply the v6.4 update to Harvard Dataverse

I asked them to connect me with others who also helped update those repositories to v6.4, I asked if the Upgrade Instructions section of the v6.4 release notes was clear, I asked if they recall following those instructions, and if so, did any technical challenges prevent the exports from being updated.

jggautier avatar Mar 18 '25 17:03 jggautier

Summarizing what we've learned so far:

  • Dieuwertje Bloemen and Eryk Kulikowski shared that they upgraded KU Leuven RDR (rdr.kuleuven.be) from 6.2 straight to 6.5, so they may not have read the 6.4 instructions as in depth as they thought they should have; they thought the steps to update the metadata exports were kind of hidden in points 9 and 10 of the upgrade procedure and they wrote that adding a section, "run this after the upgrade", might help it stand out more; and they do not follow those instructions because they use the dockerized version and the upgrade instructions are written for installations not using docker.

  • About Harvard Dataverse, Leonid shared that while running re-export is a simple step for most Dataverse instances, for Harvard Dataverse it takes more time and resources. After v6.4 was released, he tried reexporting in smaller batches, which kept failing, and then stopped to focus on other priorities.

  • Federico Yemurenko shared that they skipped the instruction's reExportAll step to do it later and it never happened. And they thought the upgrade instructions were clear.

  • @JacekChudzik shared that the upgrade instructions were clear up to step 9 about reexporting. They made changes to their Citation metadata block and the citation.tsv file, which they wrote has sometimes made it difficult to apply Dataverse updates to their repository. So they decided to skip the steps in the instructions related to reexporting. Later on, @scolapasta helped them confirm that running the export step should still be safe, so they're working on reexporting.

  • I haven't heard from folks about the Rīga Stradiņš University Institutional Repository Dataverse.

I'm going to close this GitHub issue, although I'll edit this comment if we learn more from the folks I've contacted and if we find that other repositories that upgrade their Dataverse software are still affected by this affiliation parentheses bug.

And I'm sure we can create new GitHub issues if needed to address any of the points already raised by our colleagues.

jggautier avatar Mar 26 '25 17:03 jggautier