go-toolkit icon indicating copy to clipboard operation
go-toolkit copied to clipboard

WIP HTML -> guided navigation conversion

Open chocolatkey opened this issue 2 months ago • 6 comments

Work in progress. Given the following input:

<!doctype html>
<html xmlns:epub="http://www.idpf.org/2007/ops"><!-- lang="en" xml:lang="en" -->
<body>
	<p xml:lang="fr">Paragraphe avec image: <img src="src/image.jpg" alt="A cool image" /></p>
	<p>This job requires a certain <em xml:lang="fr">savoir faire</em> that can only be acquired over time.</p>
	<p>This is a paragraph <b>with some very-<em>strong</em> bold</b> text!</p>

	<div>
	<span id="pg04" role="doc-pagebreak" epub:type="pagebreak" title="4"/>
	<p>And the next pagebreak is in the middle <span id="pg05" role="doc-pagebreak" epub:type="pagebreak" title="4"/> of a sentence.</p>
	</div>


	<section role="doc-chapter" epub:type="chapter">
		<h1>Title of the chapter</h1>
	</section>
	<ul>
		<li>First item</li>
		<li>Second item</li>
		<li>Third item</li>
	</ul>
	<p aria-hidden="true">Hidden <b>text!</b> <img src="with_image.jpg" />...</p>

	<img src="image1.avif" alt="Alternative text using the alt attribute">
	<span role="img" aria-label="Rating: 4 out of 5 stars">
		<span>★</span>
		<span>★</span>
		<span>★</span>
		<span>★</span>
		<span>☆</span>
	</span>
	<figure aria-labelledby="cat-caption"> 
		<pre>
			/\_/\
		( o.o )
				 ^ 
		</pre>
		<figcaption id="cat-caption">
		ASCII Art of a cat face
		</figcaption>
	</figure>
</body>
</html>

the following guided nav doc is generated:

{
    "guided": [
        {
            "children": [
                {
                    "children": [
                        {
                            "text": {
                                "language": "fr",
                                "plain": "Paragraphe avec image: "
                            }
                        },
                        {
                            "description": "A cool image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "text": "This job requires a certain "
                        },
                        {
                            "text": {
                                "language": "fr",
                                "plain": "savoir faire"
                            }
                        },
                        {
                            "text": " that can only be acquired over time."
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "text": "This is a paragraph with some very-strong bold text!"
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "children": [
                                {
                                    "text": "And the next pagebreak is in the middle of a sentence."
                                }
                            ],
                            "role": [
                                "paragraph"
                            ]
                        }
                    ]
                },
                {
                    "children": [
                        {
                            "children": [
                                {
                                    "text": "Title of the chapter"
                                }
                            ],
                            "role": [
                                "heading"
                            ]
                        }
                    ],
                    "role": [
                        "chapter"
                    ]
                },
                {
                    "children": [
                        {
                            "children": [
                                {
                                    "text": "First item"
                                }
                            ],
                            "role": [
                                "listItem"
                            ]
                        },
                        {
                            "children": [
                                {
                                    "text": "Second item"
                                }
                            ],
                            "role": [
                                "listItem"
                            ]
                        },
                        {
                            "children": [
                                {
                                    "text": "Third item"
                                }
                            ],
                            "role": [
                                "listItem"
                            ]
                        }
                    ],
                    "role": [
                        "list"
                    ]
                },
                {
                    "children": [
                        {
                            "imgref": "with_image.jpg",
                            "role": [
                                "image"
                            ]
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "description": "Alternative text using the alt attribute",
                    "imgref": "image1.avif",
                    "role": [
                        "image"
                    ]
                },
                {
                    "description": "Rating: 4 out of 5 stars",
                    "role": [
                        "image"
                    ]
                },
                {
                    "description": "ASCII Art of a cat face",
                    "role": [
                        "figure"
                    ]
                }
            ]
        }
    ]
}

chocolatkey avatar Sep 16 '25 12:09 chocolatkey

Looking at the results, here are a few early comments:

  • we shouldn't cut into multiple elements like we did with Content Iterator when we encounter another language, instead we should use SSML on text and indicate language changes that way
  • SSML should also handle emphasis which would cover at least <em> and <i> but probably <strong> and <b> as well
  • we seem to use too many children everywhere, for example the <h1> element should result in a single object with a role (heading), a level (it's missing right now) and a text
  • this seems to be missing support for pagebreaks, whether they're on their own or within an other element (which would require SSML)

HadrienGardeur avatar Sep 16 '25 14:09 HadrienGardeur

Updated input:

<!doctype html>
<html xmlns:epub="http://www.idpf.org/2007/ops"><!-- lang="en" xml:lang="en" -->
<body>
	<p xml:lang="fr">Paragraphe avec image: <img src="src/image.jpg" alt="A cool image" /></p>
	<p xml:lang="fr">Paragraphe avec image #1 <img src="src/image.jpg" alt="A cool image" /> et #2 <img src="src/image.jpg" alt="A second cool image" />!</p>
	<p xml:lang="fr"><img src="src/image.jpg" alt="The coolest image" /> et <img src="src/image.jpg" alt="The boring image" /></p>
	<p>A paragraph with: <img src="src/image.jpg" alt="A cool image" /><em xml:lang="fr">est cool!</em></p>
	<p><i>Simple paragraph</i></p>
	<p>This job requires a certain <em xml:lang="fr">savoir faire</em> that can only be acquired over time.</p>
	<p>This is a paragraph <b>with some very-<em>strong</em> bold</b> text!</p>
	<p>Just<br />testing<br>some<br /> breaks! And useless <span>elements</span>...</p>

	<div>
	<span id="pg04" role="doc-pagebreak" epub:type="pagebreak" title="4"/>
	<p>And the next pagebreak is in the middle <span id="pg05" role="doc-pagebreak" epub:type="pagebreak" title="4"/> of a sentence.</p>
	</div>


	<section role="doc-chapter" epub:type="chapter">
		<h1>Title of the chapter</h1>
	</section>
	<ul>
		<li>First item</li>
		<li>Second item</li>
		<li>Third item</li>
	</ul>
	<p aria-hidden="true">Hidden <b>text!</b> <img src="with_image.jpg" />...</p>
	<p aria-hidden="true">More Hidden text</p>
	<p aria-hidden="true">More Hidden text</p>

	<img src="image1.avif" alt="Alternative text using the alt attribute">
	<span role="img" aria-label="Rating: 4 out of 5 stars">
		<span>★</span>
		<span>★</span>
		<span>★</span>
		<span>★</span>
		<span>☆</span>
	</span>
	<figure aria-labelledby="cat-caption"> 
		<pre>
			/\_/\
		( o.o )
			^ 
		</pre>
		<figcaption id="cat-caption">
		ASCII Art of a cat face
		</figcaption>
	</figure>
</body>
</html>

output:

{
    "guided": [
        {
            "children": [
                {
                    "children": [
                        {
                            "text": {
                                "language": "fr",
                                "plain": "Paragraphe avec image:"
                            }
                        },
                        {
                            "description": "A cool image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "text": {
                                "language": "fr",
                                "plain": "Paragraphe avec image #1"
                            }
                        },
                        {
                            "description": "A cool image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        },
                        {
                            "text": {
                                "language": "fr",
                                "plain": "et #2"
                            }
                        },
                        {
                            "description": "A second cool image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        },
                        {
                            "text": {
                                "language": "fr",
                                "plain": "!"
                            }
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "description": "The coolest image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        },
                        {
                            "text": {
                                "language": "fr",
                                "plain": "et"
                            }
                        },
                        {
                            "description": "The boring image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "text": "A paragraph with:"
                        },
                        {
                            "description": "A cool image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        },
                        {
                            "text": {
                                "ssml": "<emphasis xml:lang=\"fr\">est cool!</emphasis>"
                            }
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "role": [
                        "paragraph"
                    ],
                    "text": {
                        "ssml": "<emphasis level=\"reduced\">Simple paragraph</emphasis>"
                    }
                },
                {
                    "role": [
                        "paragraph"
                    ],
                    "text": {
                        "ssml": "<emphasis>This job requires a certain </emphasis><lang xml:lang=\"fr\">savoir faire</lang>  that can only be acquired over time."
                    }
                },
                {
                    "role": [
                        "paragraph"
                    ],
                    "text": {
                        "ssml": "<emphasis>This is a paragraph </emphasis><emphasis>with some very-</emphasis><emphasis>strong</emphasis>  bold text!"
                    }
                },
                {
                    "role": [
                        "paragraph"
                    ],
                    "text": {
                        "ssml": "Just<break/>testing<break/>some<break/> breaks! And useless elements..."
                    }
                },
                {
                    "children": [
                        {
                            "children": [
                                {
                                    "role": [
                                        "paragraph"
                                    ],
                                    "text": "And the next pagebreak is in the middle of a sentence."
                                }
                            ],
                            "role": [
                                "pagebreak"
                            ]
                        }
                    ]
                },
                {
                    "children": [
                        {
                            "level": 1,
                            "role": [
                                "heading"
                            ],
                            "text": "Title of the chapter"
                        }
                    ],
                    "role": [
                        "chapter"
                    ]
                },
                {
                    "children": [
                        {
                            "role": [
                                "listItem"
                            ],
                            "text": "First item"
                        },
                        {
                            "role": [
                                "listItem"
                            ],
                            "text": "Second item"
                        },
                        {
                            "role": [
                                "listItem"
                            ],
                            "text": "Third item"
                        }
                    ],
                    "role": [
                        "list"
                    ]
                },
                {
                    "description": "Alternative text using the alt attribute",
                    "imgref": "image1.avif",
                    "role": [
                        "image"
                    ]
                },
                {
                    "description": "Rating: 4 out of 5 stars",
                    "role": [
                        "image"
                    ]
                },
                {
                    "description": "ASCII Art of a cat face",
                    "role": [
                        "figure"
                    ]
                }
            ]
        }
    ]
}

chocolatkey avatar Oct 20 '25 12:10 chocolatkey

Notes:

  • The following HTML --> SSML logic now takes place: <em> and <b> are turned into <emphasis>. <i> becomes <emphasis level="reduced">. <strong> becomes <emphasis level="strong">. <br> becomes <break>. Any change in language in the document becomes <lang xml:lang="xx">. Let me know if others are needed
  • In the example for "Title of Chapter", the roles are ["section", "chapter"]. The roles in the output above are just ["chapter"]. Based on the definition of section being more generic than chapter, this seems fine to me. The reason it's only chapter is because currently, if the element has a role from ARIA, inferring of the role from the actual HTML tag is skipped.
  • @HadrienGardeur What will we do about videos? There's audio/img/text ref but no video ref
  • noteref and pagebreak are WIP, I'm evaluating the best way to query link things together in the tree, whether a homegrown search will suffice or if we need goquery

chocolatkey avatar Oct 20 '25 12:10 chocolatkey

Looking better overall.

I still notice objects with just children in them when we don't match the HTML element to a role though: that's the case for <body> and <div> in this example.

Given the very large number of <div> or <span> in an ebook, it would be better if we could avoid this.

The examples with an image in the middle of a sentence also make me wonder if we shouldn't have an approach similar to pagebreaks and notes, where we use a custom SSML tag instead of breaking up text into multiple objects.

This would apply to <img>, <audio> and video.

If we go back to this example:

<p xml:lang="fr">Paragraphe avec image: <img src="src/image.jpg" alt="A cool image" /></p>

The output should look like this:

{
  "role": ["paragraph"],
  "text": {
    "language": "fr",
    "ssml": "Paragraphe avec image: <readium:image id=\"image1\" />",
    "children": [
      {
        "role": ["image"],
        "id": "image1",
        "imgref": "src/image.jpg",
        "description": "A cool image"
      }
    ]
  }
}

HadrienGardeur avatar Oct 20 '25 14:10 HadrienGardeur

For further contextualization, I think that we should include textref in our top-level nodes at least.

For example, if we add body as a role:

{
  "role": ["body"],
  "textref": "chapter.xhtml",
  "children": []
}

To further help with an implementation optimized for search and/or highlighting, we could also go beyond that and provide this information per node with fragments such as:

  • ID (#identifier)
  • and/or CSS selectors (#css(.content:nth-child(2))

For example a paragraph with par1 as its identifier:

{
  "role": ["paragraph"],
  "textref": "chapter.xhtml#par1"
}

HadrienGardeur avatar Oct 20 '25 15:10 HadrienGardeur

The following HTML --> SSML logic now takes place: <em> and <b> are turned into <emphasis>. <i> becomes <emphasis level="reduced">. <strong> becomes <emphasis level="strong">. <br> becomes <break>. Any change in language in the document becomes <lang xml:lang="xx">. Let me know if others are needed

@GoobyTheBOI any thoughts on this based on your own work?

HadrienGardeur avatar Oct 20 '25 15:10 HadrienGardeur