archive-hocr-tools
archive-hocr-tools copied to clipboard
Epub: address `epubcheck` validation errors
This PR adds two commits to address two separate epubcheck
validation error.
The first relates to the mediatype (and HTML escaping), and the second relates to the table of contents.
With respect to the fix for OPF-043
, epubcheck
took issue with the text/html
media_type
for the spine, but once this was changed, HTML needed to be escaped or HTML in the book text might be rendered.
The second commit dealing with the table of contents simply adds the Internet Archive scanning notice to the table of contents.
ebooklib
anticipates there will be a TOC when using epub.EpubNcx()
and epub.EpubNav()
, which hocr-to-epub
does use. If those aren't used, the files would need to be constructed manually, as those files are required.
Validation prior to this PR:
❯ epubcheck ./test_output_no_toc_.epub
Validating using EPUB version 3.3 rules.
ERROR(RSC-005): ./test_output_no_toc_.epub/EPUB/toc.ncx(12,12): Error while parsing file: element "navMap" incomplete; missing required element "navPoint"
ERROR(RSC-005): ./test_output_no_toc_.epub/EPUB/nav.xhtml(10,12): Error while parsing file: element "ol" incomplete; missing required element "li"
Check finished with errors
Messages: 0 fatals / 2 errors / 0 warnings / 0 infos
EPUBCheck completed
The toc.ncx
file prior to this PR:
❯ cat unzipped_no_toc/EPUB/toc.ncx
<?xml version='1.0' encoding='utf-8'?>
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1">
<head>
<meta content="sim_english-illustrated-magazine_1884-12_2_15" name="dtb:uid"/>
<meta content="0" name="dtb:depth"/>
<meta content="0" name="dtb:totalPageCount"/>
<meta content="0" name="dtb:maxPageNumber"/>
</head>
<docTitle>
<text>The English Illustrated Magazine 1884-12: Vol 2 Iss 15</text>
</docTitle>
<navMap/>
</ncx>
The nav.xhtml
file prior to this PR:
❯ cat unzipped_no_toc/EPUB/nav.xhtml
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
<head>
<title>The English Illustrated Magazine 1884-12: Vol 2 Iss 15</title>
</head>
<body>
<nav epub:type="toc" id="id" role="doc-toc">
<h2>The English Illustrated Magazine 1884-12: Vol 2 Iss 15</h2>
<ol/>
</nav>
</body>
</html>
With the notice as the TOC the validation passes.
❯ epubcheck ./test_output_with_toc.epub
Validating using EPUB version 3.3 rules.
No errors or warnings detected.
Messages: 0 fatals / 0 errors / 0 warnings / 0 infos
EPUBCheck completed
The toc.ncx
file after this PR:
❯ cat unzipped_with_toc/EPUB/toc.ncx
<?xml version='1.0' encoding='utf-8'?>
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1">
<head>
<meta content="sim_english-illustrated-magazine_1884-12_2_15" name="dtb:uid"/>
<meta content="0" name="dtb:depth"/>
<meta content="0" name="dtb:totalPageCount"/>
<meta content="0" name="dtb:maxPageNumber"/>
</head>
<docTitle>
<text>The English Illustrated Magazine 1884-12: Vol 2 Iss 15</text>
</docTitle>
<navMap>
<navPoint id="chapter_0">
<navLabel>
<text>Notice</text>
</navLabel>
<content src="notice.html"/>
</navPoint>
</navMap>
</ncx>
The nav.xhtml
file after this PR:
❯ cat unzipped_with_toc/EPUB/nav.xhtml
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
<head>
<title>The English Illustrated Magazine 1884-12: Vol 2 Iss 15</title>
</head>
<body>
<nav epub:type="toc" id="id" role="doc-toc">
<h2>The English Illustrated Magazine 1884-12: Vol 2 Iss 15</h2>
<ol>
<li>
<a href="notice.html">Notice</a>
</li>
</ol>
</nav>
</body>
</html>