table-of-contents-plus preg_match_all returns false with PREG_BAD_UTF8

preg_match_all with the u switch returns false when there is bad UTF8 characters in the pattern or subject and stops the matching process resulting in no matches and hence no TOC for the page. In all cases, it has been caused by bad characters in the subject.

Is there a way to suppress the error and continue regardless? Is there a WordPress core function that may be useful to filter the_content? Why isn't it failing for other WordPress core things considering it is the_content afterall?

Sep 02 '15 02:09 zedzedzed

In extract_headings, you can test the subject during debugging with: echo mb_check_encoding( $content );

Sep 02 '15 02:09 zedzedzed

http://php.net/manual/en/reference.pcre.pattern.modifiers.php

u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

Sep 02 '15 03:09 zedzedzed

http://us3.php.net/manual/en/function.preg-match-all.php#86366 http://gotoanswer.com/?q=UTF-8+characters+in+preg_match_all+%28PHP%29

Sep 02 '15 04:09 zedzedzed

Help or assistance is needed from developers that are experienced working with UTF8 and PHP.

Sep 04 '15 10:09 zedzedzed

I'm not experienced enough in the matter to provide a solution to the original problem, but I know that parsing HTML with regex is generally considered a bad idea. Maybe switching to some DOM library will solve both the incorrect UTF-8 problem and general limitations of HTML regex-parsing?

Sep 04 '15 13:09 dsent

            // remove non alphanumeric chars
             $aPattern = array (
        "a" => "á|à|ạ|ả|ã|ă|ắ|ằ|ặ|ẳ|ẵ|â|ấ|ầ|ậ|ẩ|ẫ|Á|À|Ạ|Ả|Ã|Ă|Ắ|Ằ|Ặ|Ẳ|Ẵ|Â|Ấ|Ầ|Ậ|Ẩ|Ẫ",
        "o" => "ó|ò|ọ|ỏ|õ|ô|ố|ồ|ộ|ổ|ỗ|ơ|ớ|ờ|ợ|ở|ỡ|Ó|Ò|Ọ|Ỏ|Õ|Ô|Ố|Ồ|Ộ|Ổ|Ỗ|Ơ|Ớ|Ờ|Ợ|Ở|Ỡ",
        "e" => "é|è|ẹ|ẻ|ẽ|ê|ế|ề|ệ|ể|ễ|É|È|Ẹ|Ẻ|Ẽ|Ê|Ế|Ề|Ệ|Ể|Ễ",
        "u" => "ú|ù|ụ|ủ|ũ|ư|ứ|ừ|ự|ử|ữ|Ú|Ù|Ụ|Ủ|Ũ|Ư|Ứ|Ừ|Ự|Ử|Ữ",
        "i" => "í|ì|ị|ỉ|ĩ|Í|Ì|Ị|Ỉ|Ĩ",
        "y" => "ý|ỳ|ỵ|ỷ|ỹ|Ý|Ỳ|Ỵ|Ỷ|Ỹ",
        "d" => "đ|Đ",
    );
    while(list($key,$value) = each($aPattern))
    {
        $return = @ereg_replace($value, $key, $return);
    }

Sep 05 '15 09:09 hieptd

@hieptd This code won't do what @zedzedzed needs.

Sep 05 '15 09:09 dsent

dsent is correct. Additionally, a solution to the code snippet provided was in WordPress's remove_accents function as mentioned in https://github.com/zedzedzed/table-of-contents-plus/issues/70 and rolled out in version 1509.

I'm troubleshooting why preg_match_all fails completely when there is a bad UTF character in the subject. Also after options, opinions, thought and alternatives that aren't too slow (costly to compute). I know there have been big improvements in PHP7 but deferring to its release cannot be an option until WordPress core requires it as a minimum.

Sep 07 '15 03:09 zedzedzed

table-of-contents-plus
table-of-contents-plus copied to clipboard

preg_match_all returns false with PREG_BAD_UTF8_ERROR (4)

table-of-contents-plus table-of-contents-plus copied to clipboard

preg_match_all returns false with PREG_BAD_UTF8_ERROR (4)

table-of-contents-plus
table-of-contents-plus copied to clipboard