epubcheck icon indicating copy to clipboard operation
epubcheck copied to clipboard

Validate content of <dc:language>

Open jon-moreira opened this issue 9 years ago • 9 comments
trafficstars

epubcheck doesn't check dc:language value!

According with specification

Every metadata section must include at least one language element with a value conforming to [RFC5646].

The following example shows a Publication is in U.S. English.

<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
    …
    <dc:language>en-US</dc:language>
    …
</metadata>

content.opf of my ePUB after export from Adobe InDesign

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<package xmlns="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" unique-identifier="bookid" version="2.0">
    <metadata>
        <meta name="generator" content="Adobe InDesign"/>
        <meta name="cover" content="xxx-cover.jpg"/>
        <dc:title>xxxx</dc:title>
        <dc:creator>xxx</dc:creator>
        <dc:subject></dc:subject>
        <dc:description>xxx</dc:description>
        <dc:publisher>Editorial Presença</dc:publisher>
        <dc:date>2016-02-11</dc:date>
        <dc:source></dc:source>
        <dc:relation></dc:relation>
        <dc:coverage></dc:coverage>
        <dc:rights></dc:rights>
        **<dc:language>en-US-POSIX</dc:language>**
        <dc:language>pt-BR</dc:language>
        <dc:identifier id="bookid">xxx</dc:identifier>
    </metadata>

<dc:language>en-US-POSIX</dc:language> doesn't have a valid value and epubcheck ignores that.

epubcheck output:

java -jar epubcheck.jar xxx.epub 
Validating using EPUB version 2.0.1 rules.
No errors or warnings detected.
epubcheck completed

jon-moreira avatar Aug 09 '16 08:08 jon-moreira

While at a first glance this looks easy to implement, it gets harder when you look at the RFC5646 spec and not only in the EPUB example: https://tools.ietf.org/html/rfc5646#appendix-A

Possibly allowed language tags:

  • de
    • (German)
  • en-US
    • (English as used in the United States)
  • zh-Hans
    • (Chinese written using the Simplified Chinese script)
  • zh-cmn-Hans-CN
    • (Chinese, Mandarin, Simplified script, as used in China)
  • sl-rozaj
    • (Resian dialect of Slovenian)
  • de-CH-1901
    • (German as used in Switzerland using the 1901 variant [orthography])
  • hy-Latn-IT-arevela
    • (Eastern Armenian written in Latin script, as used in Italy)
  • az-Arab-x-AZE-derbend
    • (private use subtags)

To be honest: That's a validation nightmare! And I don't see a quick chance to built a validation engine for that...

In fact, It could also be that your example en-US-POSIX is a valid RFC5646 language tag, although it doesn't make sense to us now...

Removing this from the "Next" milestone for the moment...


note to myself: IANA Language Subtag Registry: http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

tofi86 avatar Dec 22 '16 09:12 tofi86

Does the simple type xsd:language address this problem?

murata2makoto avatar Jul 08 '17 01:07 murata2makoto

Looking at the examples at http://www.datypic.com/sc/xsd/t-xsd_language.html this seems indeed a good way to go! I only looked at this from a Java perspective, but not from the schema validation point of view...

However, when looking at the specs, EPUB->OPF->DublinCore requires RFC5646 which obsoletes the RFC spec XML Schema is defining, right? So the DublinCore meta date may allow more valid language codes than XML schema can validate, although I don't have an example for that.

However, if @mattgarrish as our spec-guru agrees, I would give this a go and change the schema datatype to xsd:language.

tofi86 avatar Jul 08 '17 12:07 tofi86

The schemas already enforce xsd:language constraints:

opf.dc.language = element dc:language { opf.id.attr? & datatype.languagecode }

datatype.languagecode = datatype.BCP47 datatype.BCP47 = xsd:language { pattern = "[a-zA-Z]{1,8}(-[a-zA-Z0-9]{1,8})*" }

But that just enforces the lexical constraint without trying to verify the validity of the segments. The request, as I understand it, is to go further and validate the segments.

It would be great if that were done, but it seems like no small task and a perpetual moving target.

mattgarrish avatar Jul 08 '17 12:07 mattgarrish

It would be nice if meaningless tags such as en-US-POSIX are detected. But if some programming (as oppose to schema hacking) is required, I am not sure if this is important enough.

murata2makoto avatar Jul 08 '17 13:07 murata2makoto

Update: @kalaspuffar started working on this in PR #807. Review of the PR is welcome.

tofi86 avatar Nov 21 '17 21:11 tofi86

Unless we check the IANA registry, I don't think there's much we can do here more than the lexical check performed by the schema?

rdeltour avatar Nov 27 '17 23:11 rdeltour

Yes, checking if language tags are valid requires access to or a copy of the registry.

I didn't check EPUB 3.2, but the EPUB 3.0 spec text in the first comment didn't say if it requires the language tag to be well-formed or valid. The LTLI document from W3C i18n WG contains some guidance on this.

xfq avatar Aug 18 '20 08:08 xfq

We had a long discussion about well-formed v. valid for web publications and the resulting consensus was that there is little value in enforcing validity. Reading systems will react or not based on whether they recognize the language, so ensuring the general pattern is followed is all that is necessary. This really should be clarified in the epub spec.

mattgarrish avatar Aug 18 '20 11:08 mattgarrish