dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

6.1+/EZID: publish dataset fails when metadata contains ampersands

Open anarchivist opened this issue 1 year ago • 1 comments

What steps does it take to reproduce the issue?

  1. have an instance of Dataverse 6.1 or higher using EZID for DOI minting
  2. create a new dataset that has metadata containing an (unescaped) ampersand in it (i.e. & instead of &).
  3. attempt to publish the dataset
  • When does this issue occur?

    • the error occurs when you try to publish the dataset
  • Which page(s) does it occurs on?

    • edit dataset & view dataset
  • What happens?

    • publication fails
    • log messages like what's in server.log below
    • in Firefox and Chrome, the dataset page won't load because of an unescaped character
    • any existing citation metadata may become corrupted (i'm less certain about about this, but ran into it) (edit: this may be the bug fixed by #10797)
  • To whom does it occur (all users, curators, superusers)?

    • curators/superusers
  • What did you expect to happen?

    • either publication to succeed or to have a clear error message

Which version of Dataverse are you using?

6.1

Any related open or closed issues to this bug report?

#3328, #3845, #7611

Are you thinking about creating a pull request for this issue?

not at this point; existing workaround to replace ampersands with "and" will work for us

server.log
[2024-09-09T14:16:45.442-0700] [Payara 6.2023.9] [WARNING] [] [edu.harvard.iq.dataverse.DOIEZIdServiceBean] [tid: _ThreadID=109 _ThreadName=http-thread-pool::jk-connector(4)] [timeMillis: 1725916605442] [levelValue: 900]
 [[ modifyMetadata failed]]

[2024-09-09T14:16:45.442-0700] [Payara 6.2023.9] [WARNING] [] [edu.harvard.iq.dataverse.DOIEZIdServiceBean] [tid: _ThreadID=109 _ThreadName=http-thread-pool::jk-connector(4)] [timeMillis: 1725916605442] [levelValue: 900]
 [[
  String edu.ucsb.nceas.ezid.EZIDException: bad request - error="ValidationError({'datacite': ['Metadata validation error: XML parse error: EntityRef: expecting \';\', line 6, column 40 (<string>, line 6). metadata="<?xml version="1.0" encoding="UTF-8"?>\n<resource xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://s
chema.datacite.org/meta/kernel-4/metadata.xsd"\n          xmlns="http://datacite.org/schema/kernel-4"\n   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">\n    <identifier identifierType="DOI">10.60503/D3/FFXLIL</identifier>\n    <creators><creator><creatorName>S&P Global</creatorName></creator></creators>\n    <titles>\n        <title>RateWatch Scholar</title>\n    </titles>\n    <publisher>UC Berkeley Library Dataverse</publisher>\n    <publicationYear>2024</publicationYear>\n    <resourceType resourceTypeGeneral="Dataset"/>\n \n    <descriptions>\n        <description descriptionType="Abstract">RateWatch Scholar offers the academic community information for U.S. financial institutions for research and analysis. Data covers over 96,000 branch locations, depending on time period and data type, all provided voluntarily. Data is gathered from institutions of all types and sizes, including banks, credit unions, savings and loan associations, etc. The RateWatch Historical data sets focus on retail products offered to the general public. Deposit rates data: 2001 - 2020 Loan rates data: 2022 Fee data:</description>\n    </descriptions>\n    <contributors><contributor contributorType="ContactPerson"><contributorName>Library Data Services Program</contributorName><affiliation>(UC Berkeley)</affiliation></contributor></contributors>\n</resource>"']})" Metadata: {'datacite': '<?xml version="1.0" encoding="UTF-8"?>\n'              '<resource '              'xsi:schemaLocation="http://datacite.org/schema/kernel-4 '              'http://schema.datacite.org/meta/kernel-4/metadata.xsd"\n'              '          xmlns="h
ttp://datacite.org/schema/kernel-4"\n'              '          '              'xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">\n'              '    <identifier '              'identifierType="DOI">10.60503/D3/FFXLIL</identifier>\n'              '    <creators><creator><creatorName>S&P '              'Global</creatorName></creator></creators>\n'              '    <titles>\n'              '        <title>RateWatch Scholar</title>\n'              '    </titles>\n'              '    <publisher>UC Berkeley Library Dataverse</publisher>\n'              '    <publicationYear>2024</publicationYear>\n'              '    <resourceType resourceTypeGeneral="Dataset"/>\n'              '    \n'              '    <descriptions>\n'              '        <description descriptionType="Abstract">RateWatch '              'Scholar offers the academic community information for U.S. '              'financial institutions for research and analysis. Data covers '              'over 96,000 branch locations, depending on time period and data '              'type, all provided voluntarily. Data is gathered from '              'institutions of all types and sizes, including banks, credit '              'unions, savings and loan associations, etc. The RateWatch '              'Historical data sets focus on retail products offered to the '              'general public. Deposit rates data: 2001 - 2020 Loan rates data: '              '2022 Fee data:</description>\n'              '    </descriptions>\n'              '    <contributors><contributor '              'contributorType="ContactPerson"><contributorName>Library Data '              'Services Program</contributorName><affiliation>(UC '              'Berkeley)</affiliation></contributor></contributors>\n'              '</resource>',  'datacite.resourcetype': 'Dataset'}]]

anarchivist avatar Sep 09 '24 22:09 anarchivist

A cursory look at the code on the current develop branch makes me think there are no unit tests that check the escaping of XML, although there is a test edu.harvard.iq.dataverse.pidproviders.doi.datacite.XmlMetadataTemplateTest that checks simpler values against an XML Schema for DataCite.

The XmlMetadataTemplate uses the standard XmlStreamWriter, which automatically escapes strings for XML. If I modify the values in the above test and run it, they are escaped correctly.

It makes me think that even though the error you see mentions the DataCite schema, the service doesn't use the XmlMetadataTemplate. The develop branch doesn't include the edu.harvard.iq.dataverse.DOIEZIdServiceBean anymore; edu.harvard.iq.dataverse.pidproviders.doi.ezid.EZIdDOIProvider doesn't use the template, but the externally developed EZIDService. Apparently, that service doesn't escape strings for XML.

So it appears that the root problem is not in Dataverse (at least for this issue ;)).

bencomp avatar Oct 10 '24 08:10 bencomp