phpSyllable icon indicating copy to clipboard operation
phpSyllable copied to clipboard

hyphenateHtml messes up certain symbols

Open Igor-Yavych opened this issue 6 years ago • 0 comments

Hello, when I use this method on some html text, it break certain characters, but when I use hyphenateText it doesn't break these characters (though it obviously breaks html). Here is an example text:

<p>When Revolution Medicines absorbed fellow Third Rock startup Warp Drive Bio into its operations last October — then newly transitioned from antifungal to oncology — the exec team was still reviewing options for the genome mining platform, which was the subject of a deal with Roche.</p>

Here is how it comes out after I use this method on them:

<p>When Revolution Medicines absorbed fellow Third Rock startup Warp Drive Bio into its operations last October — then newly transitioned from antifungal to oncology — the exec team was still reviewing options for the genome mining platform, which was the subject of a deal with Roche.</p>

Notice how these long dashes turned into — . (It hyphenates it fine, I just removed it to make it easier to see the problem) This problem is caused by it only being partial html and loadHTML being unable to tell which encoding it is. Possible solution would be something like $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8")); // Or this dirty, dirty hack $dom->loadHTML('<?xml encoding="UTF-8">' . $html) .

P.S. <script> tag should probably be excluded from hyphenation by default.

Igor-Yavych avatar May 19 '19 17:05 Igor-Yavych