Parsing of sentence start and end
There seems to be an issue with parsing when sentences start and end making the TTS impossible to follow:
example html that has the issue:
<div id="new_page_18">
<span style="font-size:1.2149em; font-family:ff16, Arial, Arial, Helvetica, sans-serif; font-weight:700; font-style:normal; text-decoration:none;">1</span><br>
<span style="font-size:1.2149em; font-family:ff16, Arial, Arial, Helvetica, sans-serif; font-weight:700; font-style:normal; text-decoration:none;">WHAT IS TECHNOLOGY (FROM AN ETHICAL</span><br>
<span style="font-size:1.2149em; font-family:ff16, Arial, Arial, Helvetica, sans-serif; font-weight:700; font-style:normal; text-decoration:none;">POINT OF VIEW)?</span><br>
<span style="font-size:1.2149em; font-family:ff16, Arial, Arial, Helvetica, sans-serif; font-weight:700; font-style:normal; text-decoration:none;">1.1 A Hut in the Black Forest</span><br>
<span style="font-size:1.0000em; font-family:ff14, Times New Roman, Times, serif; font-weight:400; font-style:normal; text-decoration:none;">1.1 Around 100 years ago, in the Black Forest of Southern Germany, there</span><br>
<span style="font-size:1.0000em; font-family:ff14, Times New Roman, Times, serif; font-weight:400; font-style:normal; text-decoration:none;">stood a small and simple three‐room cabin, to which an eccentric</span><br>
<span style="font-size:1.0000em; font-family:ff14, Times New Roman, Times, serif; font-weight:400; font-style:normal; text-decoration:none;"> philosopher would retire in order to escape from the modern world. From</span><br>
<span style="font-size:1.0000em; font-family:ff14, Times New Roman, Times, serif; font-weight:400; font-style:normal; text-decoration:none;">1922 onward, he went there to work on philosophical texts about the nature</span><br>
<span style="font-size:1.0000em; font-family:ff14, Times New Roman, Times, serif; font-weight:400; font-style:normal; text-decoration:none;">of “being,” and he felt deeply inspired by these surroundings. The</span><br>
</div>
parses to:
stood a small and simple three‐room cabin, to which an eccentric.
philosopher would retire in order to escape from the modern world. From.
1922 onward, he went there to work on philosophical texts about the nature.
of “being,” and he felt deeply inspired by these surroundings. The.
[...]
rather then
1.1 Around 100 years ago, in the Black Forest of Southern Germany, there stood a small and simple three‐room cabin, to which an eccentric philosopher would retire in order to escape from the modern world. From 1922 onward, he went there to work on philosophical texts about the nature of “being,” and he felt deeply inspired by these surroundings. The [...]
You can see the challenge here. If the algorithm just concatenates all lines then you'll get:
1 WHAT IS TECHNOLOGY (FROM AN ETHICAL POINT OF VIEW)? 1.1 A Hut in the Black Forest 1.1 Around 100 years ago, in the Black Forest of Southern Germany, there stood a small...
The algorithm expects the page to be marked up properly, delineating headers and paragraphs with <h1> and <p> tags. When they're not marked up properly, like in this case, it will not know how to handle it.
Please also provide the URL of this website.
manually adding h1 and p tags did not help unless I also remove all <br>
i cannot provide the source as it is a website that displays manually type-setted copyrighted content using lots of spans