lessphp icon indicating copy to clipboard operation
lessphp copied to clipboard

fix unicode content parsing

Open neomoto opened this issue 12 years ago • 6 comments

fixed exception in parsing symbols like em dash or cyrillic letters. Tests are included

neomoto avatar Apr 16 '13 19:04 neomoto

Your test compiles fine without applying the patch.

leafo avatar Apr 16 '13 20:04 leafo

Sorry, I forgot explain when it happens. If you have string overloading with this settings: mbstring.internal_encoding = UTF-8 mbstring.func_overload = 2

then you have exception on "parse error: failed at `content: '—';". Specifically, this row mbstring.internal_encoding = UTF-8 causes parse error. With only mbstring.func_overload = 2 parsing goes well.

neomoto avatar Apr 17 '13 06:04 neomoto

The mbstring.func_overload ini setting can't be changed at run-time; also, I've run into systems where the mbstring extension isn't enabled.

How about prefixing strlen() and substr() calls with '_', and then adding these helper functions?

// mb_orig_strlen() and mb_orig_substr() only exist when mbstring.func_overload & 2 is true and the mbstring extension is loaded
if (function_exists('mb_orig_strlen')) {
    function _strlen($s) {
        return mb_orig_strlen($s);
    }

    function _substr() {
        return call_user_func_array('mb_orig_substr', func_get_args());
    }
} else {
    function _strlen($s) {
        return strlen($s);
    }

    function _substr() {
        return call_user_func_array('substr', func_get_args());
    }
}

robocoder avatar Apr 17 '13 15:04 robocoder

The language itself doesn't use unicode so the parser doesn't need to be aware of it (no unicode keywords or anything). The buffer can be treated as a string of 8 bit chars and still have unicode characters pass through it fine. I'm guessing the problem is that mbstring.func_overload causes the string functions to work differently than the $buffer[x] character accessor which causes a mismatch and the parser to fail?

Does changing the ini setting with ini_set just for the parse work? If that doesn't work then another option is to replace all instances of $this->buffer[] with substr($this->buffer, ...). There's probably a noticeable performance hit to this change though.

leafo avatar Apr 17 '13 17:04 leafo

Does changing the ini setting with ini_set just for the parse work?

Yes, you are right.. Changing compile method on this

    public function compile($string, $name = null) {
        $locale = setlocale(LC_NUMERIC, 0);
        setlocale(LC_NUMERIC, "C");

        $internal_encoding = NULL;
        // only if mbstring installed
        if (function_exists('mb_orig_strlen')) {
            $internal_encoding = ini_get('mbstring.internal_encoding');
            if(!is_null($internal_encoding)){
                ini_set('mbstring.internal_encoding', NULL);
            }
        }
        $this->parser = $this->makeParser($name);
        $root = $this->parser->parse($string);

        $this->env = null;
        $this->scope = null;

        $this->formatter = $this->newFormatter();

        if (!empty($this->registeredVars)) {
            $this->injectVariables($this->registeredVars);
        }

        $this->sourceParser = $this->parser; // used for error messages
        $this->compileBlock($root);

        ob_start();
        $this->formatter->block($this->scope);
        $out = ob_get_clean();
        setlocale(LC_NUMERIC, $locale);
        if(!is_null($internal_encoding)){
            ini_set('mbstring.internal_encoding', $internal_encoding);
        }
        return $out;
    }

helps and parsing is going well even if mbstring.internal_encoding = UTF-8 is set in php.ini or .htaccess

neomoto avatar Apr 17 '13 18:04 neomoto

Thanks for fix.

unnamed777 avatar Dec 09 '13 15:12 unnamed777