phpSyllable icon indicating copy to clipboard operation
phpSyllable copied to clipboard

Avoid wrapping partial HTML when hyphenated

Open gruhn opened this issue 6 years ago • 8 comments

When only partial HTML should be hyphenated (instead of full documents), DOCTYPE and html/body tags are automatically added. This can be avoided by passing the LIBXML_HTML_NOIMPLIED and LIBXML_HTML_NODEFDTD flags to loadHTML.

I need this for a project where I independently hyphenate HTML chunks. Those are assembled to a full document which in turn needs to be parsed. The additional DOCTYPE, html/body tags everywhere make the HTML hard to parse.

With this change it works fine in my use case. I don't know about others though.

gruhn avatar Jul 20 '18 14:07 gruhn

Will this also solve the problem with UTF-8 characters like öäü within the #text part of an Node?

Just asking because I did something janky with XML UTF8 Doctyping and getting the content of the resulting body Node in my usage of the code for partial HTML Documents because they were ruined because of the Dom Parser not really getting into this UTF8 encoding/decoding stuff.

Blackskyliner avatar Jul 20 '18 19:07 Blackskyliner

Unfortunately not, I additionaly used the mb_convert_encoding workaround mentioned here. This workaround can be applied without having to modify PhpSyllable itself. So I build upon the v1.4.7 commit to fix my specific issue with the latest release with minimal changes.

I noticed there are encoding related commits beyond the latest release so I simply assumed you solved this problem already for future releases.

gruhn avatar Jul 21 '18 00:07 gruhn

Yep I do the first one they mention there... Prepending the XML stuff... But I don't know I this should always be done/treated by the callee or if phpSyllable as library should just detect a missing doctype declaration.

But I guess the discussion relates to this whole issue/pull request a bit because LIBXML_HTML_NODEFDTD is all about the doctype in general (or at least the doctype definition DTD)

So I guess we should sort that out beforehand.

Suggestion: We could implement that conversion through a test if encoding was activated through setEncoding and if our self::$encoding is true we mb_convert_encoding the given text.

(Oh and I am not the maintainer of this but to get this pull request mergred you will have to rebase your patch to the latest master, fix the conflict and force push the result to your branch)

Blackskyliner avatar Jul 21 '18 08:07 Blackskyliner

Hi, I'm threes author and maintainer of this. Could you please update your change for the current master so i can merge it? I've not yet updated the version tagas there were (are still? ) quite a few changes coming in recently.

vanderlee avatar Jul 22 '18 18:07 vanderlee

Done. Also updated tests involving hyphenateHtml.

How come hyphenateHtml still wraps input with <p> elements though? I couldn't find out.

gruhn avatar Jul 22 '18 19:07 gruhn

Your test changes indicate a change in default behaviour and are a major BC!

I would like to rather see the old tests pass and have a new optional argument to hyphenateHtml which is per default set to the old behaviour -- or as distinct setting like setEncoding() which handles the optional encoding stuff.

Otherwise this change could be a major BC for ppls. who already have code in place to work around the old behaviour.

Also add new tests for the new behaviour :wink: -- Rule of thumb, if you are changing tests you break behaviour. Tests should only grow on changes to an stable api and only shrink/fluctuate on removals or major overhauls.

Blackskyliner avatar Jul 23 '18 05:07 Blackskyliner

@Blackskyliner hmm... right. What would be your go-to suggestion for implementing those changes then? Introduce the new argument? Or a new function (like hyphenateHtmlSnippet)? Or something different?

gruhn avatar Jul 23 '18 09:07 gruhn

See my 2nd paragraph. Either through optional non breaking argument to hyphenateHtml or an toggle function I would incorporate your guess into mine setHtmlSnippetMode() whereas this function would set an inner variable, just look how setEncoding() on current master does it.

I would tend to - as it seems to be the default with any other addition as far as i see it - the last option, creating a setHtmlSnippetMode() to toggle that behaviour on or off. Maybe we generalize it to setLibXmlParameters as this it what we really do here.

The last word, or better the decision on that, should be from @vanderlee how to do it for now.

In hypenateHtml you then just check for the variable if it is set or not, default is false or null to keep old behaviour the default one.

It doesn't have to be most beautiful/generalized for the time being, because when we refactor later on we should keep partial HTML hyphening in mind and design a better interface to use while refactoring.

Blackskyliner avatar Jul 23 '18 11:07 Blackskyliner

This partial HTML issue should have been addressed with PR #72 . Use the new Syllable::hyphenateHtmlText() which handles UTF-8 just fine - without adding any additional DTD and HTML.

alexander-nitsche avatar May 15 '23 06:05 alexander-nitsche