read-aloud icon indicating copy to clipboard operation
read-aloud copied to clipboard

Parsing of sentence start and end

Open sirati opened this issue 8 months ago • 2 comments

There seems to be an issue with parsing when sentences start and end making the TTS impossible to follow:

example html that has the issue:

<div id="new_page_18">
<span style="font-size:1.2149em; font-family:ff16, Arial, Arial, Helvetica, sans-serif; font-weight:700; font-style:normal; text-decoration:none;">1</span><br>
<span style="font-size:1.2149em; font-family:ff16, Arial, Arial, Helvetica, sans-serif; font-weight:700; font-style:normal; text-decoration:none;">WHAT IS TECHNOLOGY (FROM AN ETHICAL</span><br>
<span style="font-size:1.2149em; font-family:ff16, Arial, Arial, Helvetica, sans-serif; font-weight:700; font-style:normal; text-decoration:none;">POINT OF VIEW)?</span><br>
<span style="font-size:1.2149em; font-family:ff16, Arial, Arial, Helvetica, sans-serif; font-weight:700; font-style:normal; text-decoration:none;">1.1 A Hut in the Black Forest</span><br>
<span style="font-size:1.0000em; font-family:ff14, Times New Roman, Times, serif; font-weight:400; font-style:normal; text-decoration:none;">1.1 Around 100 years ago, in the Black Forest of Southern Germany, there</span><br>
<span style="font-size:1.0000em; font-family:ff14, Times New Roman, Times, serif; font-weight:400; font-style:normal; text-decoration:none;">stood a small and simple three‐room cabin, to which an eccentric</span><br>
<span style="font-size:1.0000em; font-family:ff14, Times New Roman, Times, serif; font-weight:400; font-style:normal; text-decoration:none;">&nbsp;philosopher would retire in order to escape from the modern world. From</span><br>
<span style="font-size:1.0000em; font-family:ff14, Times New Roman, Times, serif; font-weight:400; font-style:normal; text-decoration:none;">1922 onward, he went there to work on philosophical texts about the nature</span><br>
<span style="font-size:1.0000em; font-family:ff14, Times New Roman, Times, serif; font-weight:400; font-style:normal; text-decoration:none;">of “being,” and he felt deeply inspired by these surroundings. The</span><br>
</div>

parses to:


stood a small and simple three‐room cabin, to which an eccentric.

philosopher would retire in order to escape from the modern world.  From.

1922 onward, he went there to work on philosophical texts about the nature.

of “being,” and he felt deeply inspired by these surroundings.  The.

[...]

rather then 1.1 Around 100 years ago, in the Black Forest of Southern Germany, there stood a small and simple three‐room cabin, to which an eccentric philosopher would retire in order to escape from the modern world. From 1922 onward, he went there to work on philosophical texts about the nature of “being,” and he felt deeply inspired by these surroundings. The [...]

sirati avatar Apr 25 '25 18:04 sirati

You can see the challenge here. If the algorithm just concatenates all lines then you'll get:

1 WHAT IS TECHNOLOGY (FROM AN ETHICAL POINT OF VIEW)? 1.1 A Hut in the Black Forest 1.1 Around 100 years ago, in the Black Forest of Southern Germany, there stood a small...

The algorithm expects the page to be marked up properly, delineating headers and paragraphs with <h1> and <p> tags. When they're not marked up properly, like in this case, it will not know how to handle it.

Please also provide the URL of this website.

ken107 avatar Apr 26 '25 05:04 ken107

manually adding h1 and p tags did not help unless I also remove all <br>

i cannot provide the source as it is a website that displays manually type-setted copyrighted content using lots of spans

sirati avatar Apr 27 '25 12:04 sirati