hyphenateHtml messes up certain symbols
Hello, when I use this method on some html text, it break certain characters, but when I use hyphenateText it doesn't break these characters (though it obviously breaks html). Here is an example text:
<p>When Revolution Medicines absorbed fellow Third Rock startup Warp Drive Bio into its operations last October — then newly transitioned from antifungal to oncology — the exec team was still reviewing options for the genome mining platform, which was the subject of a deal with Roche.</p>
Here is how it comes out after I use this method on them:
<p>When Revolution Medicines absorbed fellow Third Rock startup Warp Drive Bio into its operations last October — then newly transitioned from antifungal to oncology — the exec team was still reviewing options for the genome mining platform, which was the subject of a deal with Roche.</p>
Notice how these long dashes turned into — . (It hyphenates it fine, I just removed it to make it easier to see the problem)
This problem is caused by it only being partial html and loadHTML being unable to tell which encoding it is.
Possible solution would be something like $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8")); //
Or this dirty, dirty hack $dom->loadHTML('<?xml encoding="UTF-8">' . $html) .
P.S. <script> tag should probably be excluded from hyphenation by default.