janeway
janeway copied to clipboard
Always validate JATS XML on upload or generation
Is your feature request related to a problem? Please describe. When a publisher sends XML from Janeway to a third party such as Portico, the third party sometimes reports issues with XML parsing. This could happen because typesetters responsible for creating XML files do not validate them before uploading. It could happen in the case of Janeway-generated XML stubs because the stub generation code has bugs.
Describe the solution you'd like In either case, we should make sure that we validate JATS XML wherever it is uploaded or created, with a feedback mechanism to the appropriate user. This will catch errors at the point of creation and avoid invalid XML getting into the Janeway data store in the first place.
Additional context
- There is a choice between using DTD and XML Schema (see this old explainer on the difference). But we already use XSD in unit tests in the identifier app as part of our testing for Crossref deposits, so it may make most sense to reuse those XSD files and validation functions.
- There is the question of which JATS version to test our XML against. I believe the XSD files in the identifier app are for JATS 1.1, so we may actually be better off getting and testing against the 1.3 schema files: https://jats.nlm.nih.gov/publishing/1.3/
It is worth noting that JATS validation has a bearing on Crossref deposits because JATS XML populates the citation_list
element of Crossref deposits.
For example, in one case a JATS file was uploaded with self-closing author
tags, which Crossref said isn't supported:
<citation key="keyref_R20">
<journal_title>Youth & Society</journal_title>
<author/>
<volume>49</volume>
<issue>4</issue>
<first_page>461</first_page>
<cYear>2017</cYear>
<article_title>
Development and validation of the critical consciousness scale.
</article_title>
</citation>
This was not caught by identifiers.logic
as of bb2c0c4b9599b3673385e4eb95ed9a591bd24fe5.