go-toolkit WIP HTML -> guided navigation conversion

Work in progress. Given the following input:

<!doctype html>
<html xmlns:epub="http://www.idpf.org/2007/ops"><!-- lang="en" xml:lang="en" -->
<body>
	<p xml:lang="fr">Paragraphe avec image: <img src="src/image.jpg" alt="A cool image" /></p>
	<p>This job requires a certain <em xml:lang="fr">savoir faire</em> that can only be acquired over time.</p>
	<p>This is a paragraph <b>with some very-<em>strong</em> bold</b> text!</p>

	<div>
	<span id="pg04" role="doc-pagebreak" epub:type="pagebreak" title="4"/>
	<p>And the next pagebreak is in the middle <span id="pg05" role="doc-pagebreak" epub:type="pagebreak" title="4"/> of a sentence.</p>
	</div>


	<section role="doc-chapter" epub:type="chapter">
		<h1>Title of the chapter</h1>
	</section>
	<ul>
		<li>First item</li>
		<li>Second item</li>
		<li>Third item</li>
	</ul>
	<p aria-hidden="true">Hidden <b>text!</b> <img src="with_image.jpg" />...</p>

	<img src="image1.avif" alt="Alternative text using the alt attribute">
	<span role="img" aria-label="Rating: 4 out of 5 stars">
		<span>★</span>
		<span>★</span>
		<span>★</span>
		<span>★</span>
		<span>☆</span>
	</span>
	<figure aria-labelledby="cat-caption"> 
		<pre>
			/\_/\
		( o.o )
				 ^ 
		</pre>
		<figcaption id="cat-caption">
		ASCII Art of a cat face
		</figcaption>
	</figure>
</body>
</html>

the following guided nav doc is generated:

{
    "guided": [
        {
            "children": [
                {
                    "children": [
                        {
                            "text": {
                                "language": "fr",
                                "plain": "Paragraphe avec image: "
                            }
                        },
                        {
                            "description": "A cool image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "text": "This job requires a certain "
                        },
                        {
                            "text": {
                                "language": "fr",
                                "plain": "savoir faire"
                            }
                        },
                        {
                            "text": " that can only be acquired over time."
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "text": "This is a paragraph with some very-strong bold text!"
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "children": [
                                {
                                    "text": "And the next pagebreak is in the middle of a sentence."
                                }
                            ],
                            "role": [
                                "paragraph"
                            ]
                        }
                    ]
                },
                {
                    "children": [
                        {
                            "children": [
                                {
                                    "text": "Title of the chapter"
                                }
                            ],
                            "role": [
                                "heading"
                            ]
                        }
                    ],
                    "role": [
                        "chapter"
                    ]
                },
                {
                    "children": [
                        {
                            "children": [
                                {
                                    "text": "First item"
                                }
                            ],
                            "role": [
                                "listItem"
                            ]
                        },
                        {
                            "children": [
                                {
                                    "text": "Second item"
                                }
                            ],
                            "role": [
                                "listItem"
                            ]
                        },
                        {
                            "children": [
                                {
                                    "text": "Third item"
                                }
                            ],
                            "role": [
                                "listItem"
                            ]
                        }
                    ],
                    "role": [
                        "list"
                    ]
                },
                {
                    "children": [
                        {
                            "imgref": "with_image.jpg",
                            "role": [
                                "image"
                            ]
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "description": "Alternative text using the alt attribute",
                    "imgref": "image1.avif",
                    "role": [
                        "image"
                    ]
                },
                {
                    "description": "Rating: 4 out of 5 stars",
                    "role": [
                        "image"
                    ]
                },
                {
                    "description": "ASCII Art of a cat face",
                    "role": [
                        "figure"
                    ]
                }
            ]
        }
    ]
}

Sep 16 '25 12:09 chocolatkey

Looking at the results, here are a few early comments:

we shouldn't cut into multiple elements like we did with Content Iterator when we encounter another language, instead we should use SSML on text and indicate language changes that way
SSML should also handle emphasis which would cover at least  and  but probably  and  as well
we seem to use too many children everywhere, for example the <h1> element should result in a single object with a role (heading), a level (it's missing right now) and a text
this seems to be missing support for pagebreaks, whether they're on their own or within an other element (which would require SSML)

Sep 16 '25 14:09 HadrienGardeur

Updated input:

<!doctype html>
<html xmlns:epub="http://www.idpf.org/2007/ops"><!-- lang="en" xml:lang="en" -->
<body>
	<p xml:lang="fr">Paragraphe avec image: <img src="src/image.jpg" alt="A cool image" /></p>
	<p xml:lang="fr">Paragraphe avec image #1 <img src="src/image.jpg" alt="A cool image" /> et #2 <img src="src/image.jpg" alt="A second cool image" />!</p>
	<p xml:lang="fr"><img src="src/image.jpg" alt="The coolest image" /> et <img src="src/image.jpg" alt="The boring image" /></p>
	<p>A paragraph with: <img src="src/image.jpg" alt="A cool image" /><em xml:lang="fr">est cool!</em></p>
	<p><i>Simple paragraph</i></p>
	<p>This job requires a certain <em xml:lang="fr">savoir faire</em> that can only be acquired over time.</p>
	<p>This is a paragraph <b>with some very-<em>strong</em> bold</b> text!</p>
	<p>Just<br />testing<br>some<br /> breaks! And useless <span>elements</span>...</p>

	<div>
	<span id="pg04" role="doc-pagebreak" epub:type="pagebreak" title="4"/>
	<p>And the next pagebreak is in the middle <span id="pg05" role="doc-pagebreak" epub:type="pagebreak" title="4"/> of a sentence.</p>
	</div>


	<section role="doc-chapter" epub:type="chapter">
		<h1>Title of the chapter</h1>
	</section>
	<ul>
		<li>First item</li>
		<li>Second item</li>
		<li>Third item</li>
	</ul>
	<p aria-hidden="true">Hidden <b>text!</b> <img src="with_image.jpg" />...</p>
	<p aria-hidden="true">More Hidden text</p>
	<p aria-hidden="true">More Hidden text</p>

	<img src="image1.avif" alt="Alternative text using the alt attribute">
	<span role="img" aria-label="Rating: 4 out of 5 stars">
		<span>★</span>
		<span>★</span>
		<span>★</span>
		<span>★</span>
		<span>☆</span>
	</span>
	<figure aria-labelledby="cat-caption"> 
		<pre>
			/\_/\
		( o.o )
			^ 
		</pre>
		<figcaption id="cat-caption">
		ASCII Art of a cat face
		</figcaption>
	</figure>
</body>
</html>

output:

{
    "guided": [
        {
            "children": [
                {
                    "children": [
                        {
                            "text": {
                                "language": "fr",
                                "plain": "Paragraphe avec image:"
                            }
                        },
                        {
                            "description": "A cool image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "text": {
                                "language": "fr",
                                "plain": "Paragraphe avec image #1"
                            }
                        },
                        {
                            "description": "A cool image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        },
                        {
                            "text": {
                                "language": "fr",
                                "plain": "et #2"
                            }
                        },
                        {
                            "description": "A second cool image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        },
                        {
                            "text": {
                                "language": "fr",
                                "plain": "!"
                            }
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "description": "The coolest image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        },
                        {
                            "text": {
                                "language": "fr",
                                "plain": "et"
                            }
                        },
                        {
                            "description": "The boring image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "children": [
                        {
                            "text": "A paragraph with:"
                        },
                        {
                            "description": "A cool image",
                            "imgref": "src/image.jpg",
                            "role": [
                                "image"
                            ]
                        },
                        {
                            "text": {
                                "ssml": "<emphasis xml:lang=\"fr\">est cool!</emphasis>"
                            }
                        }
                    ],
                    "role": [
                        "paragraph"
                    ]
                },
                {
                    "role": [
                        "paragraph"
                    ],
                    "text": {
                        "ssml": "<emphasis level=\"reduced\">Simple paragraph</emphasis>"
                    }
                },
                {
                    "role": [
                        "paragraph"
                    ],
                    "text": {
                        "ssml": "<emphasis>This job requires a certain </emphasis><lang xml:lang=\"fr\">savoir faire</lang>  that can only be acquired over time."
                    }
                },
                {
                    "role": [
                        "paragraph"
                    ],
                    "text": {
                        "ssml": "<emphasis>This is a paragraph </emphasis><emphasis>with some very-</emphasis><emphasis>strong</emphasis>  bold text!"
                    }
                },
                {
                    "role": [
                        "paragraph"
                    ],
                    "text": {
                        "ssml": "Just<break/>testing<break/>some<break/> breaks! And useless elements..."
                    }
                },
                {
                    "children": [
                        {
                            "children": [
                                {
                                    "role": [
                                        "paragraph"
                                    ],
                                    "text": "And the next pagebreak is in the middle of a sentence."
                                }
                            ],
                            "role": [
                                "pagebreak"
                            ]
                        }
                    ]
                },
                {
                    "children": [
                        {
                            "level": 1,
                            "role": [
                                "heading"
                            ],
                            "text": "Title of the chapter"
                        }
                    ],
                    "role": [
                        "chapter"
                    ]
                },
                {
                    "children": [
                        {
                            "role": [
                                "listItem"
                            ],
                            "text": "First item"
                        },
                        {
                            "role": [
                                "listItem"
                            ],
                            "text": "Second item"
                        },
                        {
                            "role": [
                                "listItem"
                            ],
                            "text": "Third item"
                        }
                    ],
                    "role": [
                        "list"
                    ]
                },
                {
                    "description": "Alternative text using the alt attribute",
                    "imgref": "image1.avif",
                    "role": [
                        "image"
                    ]
                },
                {
                    "description": "Rating: 4 out of 5 stars",
                    "role": [
                        "image"
                    ]
                },
                {
                    "description": "ASCII Art of a cat face",
                    "role": [
                        "figure"
                    ]
                }
            ]
        }
    ]
}

Oct 20 '25 12:10 chocolatkey

Notes:

The following HTML --> SSML logic now takes place:  and  are turned into <emphasis>.  becomes <emphasis level="reduced">.  becomes <emphasis level="strong">.   becomes <break>. Any change in language in the document becomes <lang xml:lang="xx">. Let me know if others are needed
In the example for "Title of Chapter", the roles are ["section", "chapter"]. The roles in the output above are just ["chapter"]. Based on the definition of section being more generic than chapter, this seems fine to me. The reason it's only chapter is because currently, if the element has a role from ARIA, inferring of the role from the actual HTML tag is skipped.
@HadrienGardeur What will we do about videos? There's audio/img/text ref but no video ref
noteref and pagebreak are WIP, I'm evaluating the best way to query link things together in the tree, whether a homegrown search will suffice or if we need goquery

Oct 20 '25 12:10 chocolatkey

Looking better overall.

I still notice objects with just children in them when we don't match the HTML element to a role though: that's the case for <body> and <div> in this example.

Given the very large number of <div> or  in an ebook, it would be better if we could avoid this.

The examples with an image in the middle of a sentence also make me wonder if we shouldn't have an approach similar to pagebreaks and notes, where we use a custom SSML tag instead of breaking up text into multiple objects.

This would apply to <img>, <audio> and video.

If we go back to this example:

<p xml:lang="fr">Paragraphe avec image: <img src="src/image.jpg" alt="A cool image" /></p>

The output should look like this:

{
  "role": ["paragraph"],
  "text": {
    "language": "fr",
    "ssml": "Paragraphe avec image: <readium:image id=\"image1\" />",
    "children": [
      {
        "role": ["image"],
        "id": "image1",
        "imgref": "src/image.jpg",
        "description": "A cool image"
      }
    ]
  }
}

Oct 20 '25 14:10 HadrienGardeur

For further contextualization, I think that we should include textref in our top-level nodes at least.

For example, if we add body as a role:

{
  "role": ["body"],
  "textref": "chapter.xhtml",
  "children": []
}

To further help with an implementation optimized for search and/or highlighting, we could also go beyond that and provide this information per node with fragments such as:

ID (#identifier)
and/or CSS selectors (#css(.content:nth-child(2))

For example a paragraph with par1 as its identifier:

{
  "role": ["paragraph"],
  "textref": "chapter.xhtml#par1"
}

Oct 20 '25 15:10 HadrienGardeur

The following HTML --> SSML logic now takes place:  and  are turned into <emphasis>.  becomes <emphasis level="reduced">.  becomes <emphasis level="strong">.   becomes <break>. Any change in language in the document becomes <lang xml:lang="xx">. Let me know if others are needed

@GoobyTheBOI any thoughts on this based on your own work?

Oct 20 '25 15:10 HadrienGardeur