WIP HTML -> guided navigation conversion
Work in progress. Given the following input:
<!doctype html>
<html xmlns:epub="http://www.idpf.org/2007/ops"><!-- lang="en" xml:lang="en" -->
<body>
<p xml:lang="fr">Paragraphe avec image: <img src="src/image.jpg" alt="A cool image" /></p>
<p>This job requires a certain <em xml:lang="fr">savoir faire</em> that can only be acquired over time.</p>
<p>This is a paragraph <b>with some very-<em>strong</em> bold</b> text!</p>
<div>
<span id="pg04" role="doc-pagebreak" epub:type="pagebreak" title="4"/>
<p>And the next pagebreak is in the middle <span id="pg05" role="doc-pagebreak" epub:type="pagebreak" title="4"/> of a sentence.</p>
</div>
<section role="doc-chapter" epub:type="chapter">
<h1>Title of the chapter</h1>
</section>
<ul>
<li>First item</li>
<li>Second item</li>
<li>Third item</li>
</ul>
<p aria-hidden="true">Hidden <b>text!</b> <img src="with_image.jpg" />...</p>
<img src="image1.avif" alt="Alternative text using the alt attribute">
<span role="img" aria-label="Rating: 4 out of 5 stars">
<span>★</span>
<span>★</span>
<span>★</span>
<span>★</span>
<span>☆</span>
</span>
<figure aria-labelledby="cat-caption">
<pre>
/\_/\
( o.o )
^
</pre>
<figcaption id="cat-caption">
ASCII Art of a cat face
</figcaption>
</figure>
</body>
</html>
the following guided nav doc is generated:
{
"guided": [
{
"children": [
{
"children": [
{
"text": {
"language": "fr",
"plain": "Paragraphe avec image: "
}
},
{
"description": "A cool image",
"imgref": "src/image.jpg",
"role": [
"image"
]
}
],
"role": [
"paragraph"
]
},
{
"children": [
{
"text": "This job requires a certain "
},
{
"text": {
"language": "fr",
"plain": "savoir faire"
}
},
{
"text": " that can only be acquired over time."
}
],
"role": [
"paragraph"
]
},
{
"children": [
{
"text": "This is a paragraph with some very-strong bold text!"
}
],
"role": [
"paragraph"
]
},
{
"children": [
{
"children": [
{
"text": "And the next pagebreak is in the middle of a sentence."
}
],
"role": [
"paragraph"
]
}
]
},
{
"children": [
{
"children": [
{
"text": "Title of the chapter"
}
],
"role": [
"heading"
]
}
],
"role": [
"chapter"
]
},
{
"children": [
{
"children": [
{
"text": "First item"
}
],
"role": [
"listItem"
]
},
{
"children": [
{
"text": "Second item"
}
],
"role": [
"listItem"
]
},
{
"children": [
{
"text": "Third item"
}
],
"role": [
"listItem"
]
}
],
"role": [
"list"
]
},
{
"children": [
{
"imgref": "with_image.jpg",
"role": [
"image"
]
}
],
"role": [
"paragraph"
]
},
{
"description": "Alternative text using the alt attribute",
"imgref": "image1.avif",
"role": [
"image"
]
},
{
"description": "Rating: 4 out of 5 stars",
"role": [
"image"
]
},
{
"description": "ASCII Art of a cat face",
"role": [
"figure"
]
}
]
}
]
}
Looking at the results, here are a few early comments:
- we shouldn't cut into multiple elements like we did with Content Iterator when we encounter another language, instead we should use SSML on
textand indicate language changes that way - SSML should also handle emphasis which would cover at least
<em>and<i>but probably<strong>and<b>as well - we seem to use too many
childreneverywhere, for example the<h1>element should result in a single object with a role (heading), alevel(it's missing right now) and atext - this seems to be missing support for pagebreaks, whether they're on their own or within an other element (which would require SSML)
Updated input:
<!doctype html>
<html xmlns:epub="http://www.idpf.org/2007/ops"><!-- lang="en" xml:lang="en" -->
<body>
<p xml:lang="fr">Paragraphe avec image: <img src="src/image.jpg" alt="A cool image" /></p>
<p xml:lang="fr">Paragraphe avec image #1 <img src="src/image.jpg" alt="A cool image" /> et #2 <img src="src/image.jpg" alt="A second cool image" />!</p>
<p xml:lang="fr"><img src="src/image.jpg" alt="The coolest image" /> et <img src="src/image.jpg" alt="The boring image" /></p>
<p>A paragraph with: <img src="src/image.jpg" alt="A cool image" /><em xml:lang="fr">est cool!</em></p>
<p><i>Simple paragraph</i></p>
<p>This job requires a certain <em xml:lang="fr">savoir faire</em> that can only be acquired over time.</p>
<p>This is a paragraph <b>with some very-<em>strong</em> bold</b> text!</p>
<p>Just<br />testing<br>some<br /> breaks! And useless <span>elements</span>...</p>
<div>
<span id="pg04" role="doc-pagebreak" epub:type="pagebreak" title="4"/>
<p>And the next pagebreak is in the middle <span id="pg05" role="doc-pagebreak" epub:type="pagebreak" title="4"/> of a sentence.</p>
</div>
<section role="doc-chapter" epub:type="chapter">
<h1>Title of the chapter</h1>
</section>
<ul>
<li>First item</li>
<li>Second item</li>
<li>Third item</li>
</ul>
<p aria-hidden="true">Hidden <b>text!</b> <img src="with_image.jpg" />...</p>
<p aria-hidden="true">More Hidden text</p>
<p aria-hidden="true">More Hidden text</p>
<img src="image1.avif" alt="Alternative text using the alt attribute">
<span role="img" aria-label="Rating: 4 out of 5 stars">
<span>★</span>
<span>★</span>
<span>★</span>
<span>★</span>
<span>☆</span>
</span>
<figure aria-labelledby="cat-caption">
<pre>
/\_/\
( o.o )
^
</pre>
<figcaption id="cat-caption">
ASCII Art of a cat face
</figcaption>
</figure>
</body>
</html>
output:
{
"guided": [
{
"children": [
{
"children": [
{
"text": {
"language": "fr",
"plain": "Paragraphe avec image:"
}
},
{
"description": "A cool image",
"imgref": "src/image.jpg",
"role": [
"image"
]
}
],
"role": [
"paragraph"
]
},
{
"children": [
{
"text": {
"language": "fr",
"plain": "Paragraphe avec image #1"
}
},
{
"description": "A cool image",
"imgref": "src/image.jpg",
"role": [
"image"
]
},
{
"text": {
"language": "fr",
"plain": "et #2"
}
},
{
"description": "A second cool image",
"imgref": "src/image.jpg",
"role": [
"image"
]
},
{
"text": {
"language": "fr",
"plain": "!"
}
}
],
"role": [
"paragraph"
]
},
{
"children": [
{
"description": "The coolest image",
"imgref": "src/image.jpg",
"role": [
"image"
]
},
{
"text": {
"language": "fr",
"plain": "et"
}
},
{
"description": "The boring image",
"imgref": "src/image.jpg",
"role": [
"image"
]
}
],
"role": [
"paragraph"
]
},
{
"children": [
{
"text": "A paragraph with:"
},
{
"description": "A cool image",
"imgref": "src/image.jpg",
"role": [
"image"
]
},
{
"text": {
"ssml": "<emphasis xml:lang=\"fr\">est cool!</emphasis>"
}
}
],
"role": [
"paragraph"
]
},
{
"role": [
"paragraph"
],
"text": {
"ssml": "<emphasis level=\"reduced\">Simple paragraph</emphasis>"
}
},
{
"role": [
"paragraph"
],
"text": {
"ssml": "<emphasis>This job requires a certain </emphasis><lang xml:lang=\"fr\">savoir faire</lang> that can only be acquired over time."
}
},
{
"role": [
"paragraph"
],
"text": {
"ssml": "<emphasis>This is a paragraph </emphasis><emphasis>with some very-</emphasis><emphasis>strong</emphasis> bold text!"
}
},
{
"role": [
"paragraph"
],
"text": {
"ssml": "Just<break/>testing<break/>some<break/> breaks! And useless elements..."
}
},
{
"children": [
{
"children": [
{
"role": [
"paragraph"
],
"text": "And the next pagebreak is in the middle of a sentence."
}
],
"role": [
"pagebreak"
]
}
]
},
{
"children": [
{
"level": 1,
"role": [
"heading"
],
"text": "Title of the chapter"
}
],
"role": [
"chapter"
]
},
{
"children": [
{
"role": [
"listItem"
],
"text": "First item"
},
{
"role": [
"listItem"
],
"text": "Second item"
},
{
"role": [
"listItem"
],
"text": "Third item"
}
],
"role": [
"list"
]
},
{
"description": "Alternative text using the alt attribute",
"imgref": "image1.avif",
"role": [
"image"
]
},
{
"description": "Rating: 4 out of 5 stars",
"role": [
"image"
]
},
{
"description": "ASCII Art of a cat face",
"role": [
"figure"
]
}
]
}
]
}
Notes:
- The following HTML --> SSML logic now takes place:
<em>and<b>are turned into<emphasis>.<i>becomes<emphasis level="reduced">.<strong>becomes<emphasis level="strong">.<br>becomes<break>. Any change in language in the document becomes<lang xml:lang="xx">. Let me know if others are needed - In the example for "Title of Chapter", the roles are
["section", "chapter"]. The roles in the output above are just["chapter"]. Based on the definition of section being more generic than chapter, this seems fine to me. The reason it's onlychapteris because currently, if the element has a role from ARIA, inferring of the role from the actual HTML tag is skipped. - @HadrienGardeur What will we do about videos? There's audio/img/text ref but no video ref
- noteref and pagebreak are WIP, I'm evaluating the best way to query link things together in the tree, whether a homegrown search will suffice or if we need
goquery
Looking better overall.
I still notice objects with just children in them when we don't match the HTML element to a role though: that's the case for <body> and <div> in this example.
Given the very large number of <div> or <span> in an ebook, it would be better if we could avoid this.
The examples with an image in the middle of a sentence also make me wonder if we shouldn't have an approach similar to pagebreaks and notes, where we use a custom SSML tag instead of breaking up text into multiple objects.
This would apply to <img>, <audio> and video.
If we go back to this example:
<p xml:lang="fr">Paragraphe avec image: <img src="src/image.jpg" alt="A cool image" /></p>
The output should look like this:
{
"role": ["paragraph"],
"text": {
"language": "fr",
"ssml": "Paragraphe avec image: <readium:image id=\"image1\" />",
"children": [
{
"role": ["image"],
"id": "image1",
"imgref": "src/image.jpg",
"description": "A cool image"
}
]
}
}
For further contextualization, I think that we should include textref in our top-level nodes at least.
For example, if we add body as a role:
{
"role": ["body"],
"textref": "chapter.xhtml",
"children": []
}
To further help with an implementation optimized for search and/or highlighting, we could also go beyond that and provide this information per node with fragments such as:
- ID (
#identifier) - and/or CSS selectors (
#css(.content:nth-child(2))
For example a paragraph with par1 as its identifier:
{
"role": ["paragraph"],
"textref": "chapter.xhtml#par1"
}
The following HTML --> SSML logic now takes place:
<em>and<b>are turned into<emphasis>.<i>becomes<emphasis level="reduced">.<strong>becomes<emphasis level="strong">.<br>becomes<break>. Any change in language in the document becomes<lang xml:lang="xx">. Let me know if others are needed
@GoobyTheBOI any thoughts on this based on your own work?