phpSyllable
phpSyllable copied to clipboard
Avoid wrapping partial HTML when hyphenated
When only partial HTML should be hyphenated (instead of full documents), DOCTYPE and html/body tags are automatically added. This can be avoided by passing the LIBXML_HTML_NOIMPLIED
and LIBXML_HTML_NODEFDTD
flags to loadHTML
.
I need this for a project where I independently hyphenate HTML chunks. Those are assembled to a full document which in turn needs to be parsed. The additional DOCTYPE, html/body tags everywhere make the HTML hard to parse.
With this change it works fine in my use case. I don't know about others though.
Will this also solve the problem with UTF-8 characters like öäü within the #text part of an Node?
Just asking because I did something janky with XML UTF8 Doctyping and getting the content of the resulting body Node in my usage of the code for partial HTML Documents because they were ruined because of the Dom Parser not really getting into this UTF8 encoding/decoding stuff.
Unfortunately not, I additionaly used the mb_convert_encoding
workaround mentioned here. This workaround can be applied without having to modify PhpSyllable itself. So I build upon the v1.4.7 commit to fix my specific issue with the latest release with minimal changes.
I noticed there are encoding related commits beyond the latest release so I simply assumed you solved this problem already for future releases.
Yep I do the first one they mention there... Prepending the XML stuff... But I don't know I this should always be done/treated by the callee or if phpSyllable as library should just detect a missing doctype declaration.
But I guess the discussion relates to this whole issue/pull request a bit because LIBXML_HTML_NODEFDTD
is all about the doctype in general (or at least the doctype definition DTD)
So I guess we should sort that out beforehand.
Suggestion: We could implement that conversion through a test if encoding was activated through setEncoding
and if our self::$encoding
is true we mb_convert_encoding
the given text.
(Oh and I am not the maintainer of this but to get this pull request mergred you will have to rebase your patch to the latest master, fix the conflict and force push the result to your branch)
Hi, I'm threes author and maintainer of this. Could you please update your change for the current master so i can merge it? I've not yet updated the version tagas there were (are still? ) quite a few changes coming in recently.
Done. Also updated tests involving hyphenateHtml
.
How come hyphenateHtml
still wraps input with <p>
elements though? I couldn't find out.
Your test changes indicate a change in default behaviour and are a major BC!
I would like to rather see the old tests pass and have a new optional argument to hyphenateHtml
which is per default set to the old behaviour -- or as distinct setting like setEncoding()
which handles the optional encoding stuff.
Otherwise this change could be a major BC for ppls. who already have code in place to work around the old behaviour.
Also add new tests for the new behaviour :wink: -- Rule of thumb, if you are changing tests you break behaviour. Tests should only grow on changes to an stable api and only shrink/fluctuate on removals or major overhauls.
@Blackskyliner hmm... right. What would be your go-to suggestion for implementing those changes then? Introduce the new argument? Or a new function (like hyphenateHtmlSnippet
)? Or something different?
See my 2nd paragraph. Either through optional non breaking argument to hyphenateHtml
or an toggle function I would incorporate your guess into mine setHtmlSnippetMode()
whereas this function would set an inner variable, just look how setEncoding()
on current master does it.
I would tend to - as it seems to be the default with any other addition as far as i see it - the last option, creating a setHtmlSnippetMode()
to toggle that behaviour on or off. Maybe we generalize it to setLibXmlParameters
as this it what we really do here.
The last word, or better the decision on that, should be from @vanderlee how to do it for now.
In hypenateHtml
you then just check for the variable if it is set or not, default is false
or null
to keep old behaviour the default one.
It doesn't have to be most beautiful/generalized for the time being, because when we refactor later on we should keep partial HTML hyphening in mind and design a better interface to use while refactoring.
This partial HTML issue should have been addressed with PR #72 . Use the new Syllable::hyphenateHtmlText()
which handles UTF-8 just fine - without adding any additional DTD and HTML.