php-dom-wrapper
php-dom-wrapper copied to clipboard
Charset detection bug when decoding unformatted html...
trafficstars
A bit of an unusual edge case, but I have found the library can incorrectly interpret the charset if the html content is in a single line and doesn't declare the charset explicitly but does contain a charset declaration within another part of the document, for example, as part of an href property.
I have concocted an example, based on a document which caused us problems here...
<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta property="Content-Type" content="application/xhtml+xml"/> <title>News Story</title></head><body><section class="news-header"><p>By News Provider</p></section><section class="text-headlines"><h1>Lorem Ipsum</h1></section><section class="news-body"><p>Sed laoreet orci vel nunc imperdiet, non ultricies orci bibendum. Fusce mi elit, vehicula non lacinia eu, luctus sed lectus. Donec at finibus mauris, ut fringilla libero. Cras maximus lacus sit amet elementum imperdiet. Interdum et malesuada fames ac ante ipsum primis in faucibus. Proin pellentesque purus in arcu fermentum sagittis. <a href="http://example.com/ExternalLink?id=7104846651&rd=down&charset=UTF-8&affiliate_index=1234567&method=affiliate_data">Suspendisse nisi mi</a>, vulputate eu orci sed, aliquam interdum sem. In fringilla suscipit enim at scelerisque. Integer accumsan tortor aliquet, congue lorem id, sagittis velit. Pellentesque pulvinar lacus ac arcu cursus, vitae eleifend tortor pellentesque. Nunc at elementum risus, fringilla venenatis ante. Morbi maximus lacus non tincidunt tincidunt. Etiam venenatis mattis nisl, non vulputate felis accumsan eget. Duis vel varius libero.</p></section><body></html>