table-of-contents-plus
table-of-contents-plus copied to clipboard
preg_match_all returns false with PREG_BAD_UTF8_ERROR (4)
preg_match_all with the u switch returns false when there is bad UTF8 characters in the pattern or subject and stops the matching process resulting in no matches and hence no TOC for the page. In all cases, it has been caused by bad characters in the subject.
Is there a way to suppress the error and continue regardless? Is there a WordPress core function that may be useful to filter the_content? Why isn't it failing for other WordPress core things considering it is the_content afterall?
In extract_headings, you can test the subject during debugging with:
echo mb_check_encoding( $content );
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.
http://us3.php.net/manual/en/function.preg-match-all.php#86366 http://gotoanswer.com/?q=UTF-8+characters+in+preg_match_all+%28PHP%29
Help or assistance is needed from developers that are experienced working with UTF8 and PHP.
I'm not experienced enough in the matter to provide a solution to the original problem, but I know that parsing HTML with regex is generally considered a bad idea. Maybe switching to some DOM library will solve both the incorrect UTF-8 problem and general limitations of HTML regex-parsing?
// remove non alphanumeric chars
$aPattern = array (
"a" => "á|à|ạ|ả|ã|ă|ắ|ằ|ặ|ẳ|ẵ|â|ấ|ầ|ậ|ẩ|ẫ|Á|À|Ạ|Ả|Ã|Ă|Ắ|Ằ|Ặ|Ẳ|Ẵ|Â|Ấ|Ầ|Ậ|Ẩ|Ẫ",
"o" => "ó|ò|ọ|ỏ|õ|ô|ố|ồ|ộ|ổ|ỗ|ơ|ớ|ờ|ợ|ở|ỡ|Ó|Ò|Ọ|Ỏ|Õ|Ô|Ố|Ồ|Ộ|Ổ|Ỗ|Ơ|Ớ|Ờ|Ợ|Ở|Ỡ",
"e" => "é|è|ẹ|ẻ|ẽ|ê|ế|ề|ệ|ể|ễ|É|È|Ẹ|Ẻ|Ẽ|Ê|Ế|Ề|Ệ|Ể|Ễ",
"u" => "ú|ù|ụ|ủ|ũ|ư|ứ|ừ|ự|ử|ữ|Ú|Ù|Ụ|Ủ|Ũ|Ư|Ứ|Ừ|Ự|Ử|Ữ",
"i" => "í|ì|ị|ỉ|ĩ|Í|Ì|Ị|Ỉ|Ĩ",
"y" => "ý|ỳ|ỵ|ỷ|ỹ|Ý|Ỳ|Ỵ|Ỷ|Ỹ",
"d" => "đ|Đ",
);
while(list($key,$value) = each($aPattern))
{
$return = @ereg_replace($value, $key, $return);
}
@hieptd This code won't do what @zedzedzed needs.
dsent is correct. Additionally, a solution to the code snippet provided was in WordPress's remove_accents function as mentioned in https://github.com/zedzedzed/table-of-contents-plus/issues/70 and rolled out in version 1509.
I'm troubleshooting why preg_match_all fails completely when there is a bad UTF character in the subject. Also after options, opinions, thought and alternatives that aren't too slow (costly to compute). I know there have been big improvements in PHP7 but deferring to its release cannot be an option until WordPress core requires it as a minimum.