PHP-CSS-Parser icon indicating copy to clipboard operation
PHP-CSS-Parser copied to clipboard

Fix encoding issues and improve parsing performance

Open skodak opened this issue 8 years ago • 10 comments

Hello,

I was wondering if you would be interested in a patch that tries to solve some charset related issues and improves parsing performance. I was trying to parse a 500kb CSS file (Bootstrap 3, Font Awesome and some custom stuff) and it took more than 70 seconds, with this patch I managed to get it down to 3.5-7 seconds. While working on the patch I have noticed a few other problems.

List of changes:

  • internally all strings are now stored as UTF-8
  • BOM is used for utf-8 detection and removed if found
  • setCharset() was removed completely - all charset conversions are done in constructor
  • the rendered text is always returned as utf-8 with all CSS strings \encoded as ASCII chars
  • the @charset atRule is switched to 'utf-8' if present
  • mbstring extension is not required any more
  • iconv is required - previously it was required only if there were \encoded unicode chars
  • all non-ascii unicode chars are \encoded in CSS strings
  • null character is completely ignored in CSS strings for security reasons

Ciao, Petr

skodak avatar Aug 20 '16 12:08 skodak

Thanks for this pull-request!

I like the approach of moving the whole internal processing to a well-defined charset (UTF-8).

But I really think the parser should, by default and unless the user explicitly requests something else, output the document in its original encoding. This means using iconv at the end to re-convert the output string back to its original encoding and preserving original value of the @charset rule.

Also, converting any non-ASCII chars to unicode escapes seems wasteful too me in a day and age where most files are served using UTF-8 and do contain some non-ASCII chars. In my opinion, this should be an option of OutputFormat that can be configured to: a) escape all non-ASCII chars, or, b) only escape the characters not representable in the output charset (this should be the default).

Ideally we could add a third option: escape all the chars that were escaped in the input and leave the chars unescaped that appeared verbatim in the input. But since we don’t store that information, this should prove difficult to implement. Maybe we could add a way for users to configure ranges of characters to always escape even if they would be representable in the output charset.

Also, one of the open issues of the parser is handling UTF-16 files (either with or without BOM). With your changes, this should finally be fairly simple to implement: the heuristic that searches for the @charset rule should also test, as a fallback, if it can find 0'@'0'c'0'h'0'a'0'r'0's'0'e'0't' (for BE) or '@'0'c'0'h'0'a'0'r'0's'0'e'0't'0 (for LE) and parse its value. UTF-8, 16-LE, 16-BE should all be covered by at least one test case, each with and without BOM.

sabberworm avatar Aug 21 '16 08:08 sabberworm

The problem with returning text in different encoding is that some characters may not be present in other encodings. I have converted only the CSS strings back to escaped form, those are the strings in double quotes (such as the content in font icon definitions - that was where I needed it). If I understand it correctly the identifiers are not touched which means they might be still in unicode. Also UTF-8 is now the default and recommended encoding in PHP, the same goes for CSS.

I am not sure if I ever saw anything encoded in UTF-16, I agree that detecting it and converting it to UTF-8 is possible and easy, on the other had I do not think UTF-16 has any use for web because it is not compatible with ascii - which means no developer or designer should ever try to create CSS in UTF-16. I was more worried about the GB2312 encoding which was required to be used in China, but luckily it appears not to be used much these days.

Anyway I will be very busy with our company project in the next few weeks and then I can work a bit more on this - adding the UTF-16, optional output encoding witch @charset change and some more tests. I'll need to study the CSS spec a bit more I guess.

Thanks a lot for having a detailed look at my patch!

skodak avatar Aug 22 '16 06:08 skodak

hey @FMCorz! I guess you might be interested in this patch...

skodak avatar Aug 22 '16 06:08 skodak

Thanks @skodak! Looks like we're solving similar problems, how surprising :). Cheers!

FMCorz avatar Aug 22 '16 06:08 FMCorz

Hi, I have reworked the patch, I got much deeper than I wanted originally - the problem was that the strict parsing did not work much for me when I started writing tests for the new UTF stuff - I had to fix some bugs first. When I started stepping through the code in debugger my hand started to hurt from the repeated clicking, so I rewrote some parts to stop calling peek() a million times.

I am not finished yet, I would like to add more test coverage to make sure there are no regressions.

Ciao

skodak avatar Aug 23 '16 19:08 skodak

Thanks.

sabberworm avatar Aug 29 '16 09:08 sabberworm

Although I don't know so much about this matter, I'd like to see this PR get merged (or refined so far that it becomes mergeable). I had some encoding troubles myself that I could fix using kind of a hack. Also I could really use the performance boost. In the current state I can only use this in combination with a file cache. Any update on this PR?

jkrzefski avatar Feb 06 '17 11:02 jkrzefski

This patch looks good. @skodak what exact changes did improve the performance that lot? I am using https://github.com/cweagans/composer-patches and want to add a patch to improve parsing performance. :-)

matzeeable avatar Dec 22 '21 08:12 matzeeable

Is there any update OR solution for this bug?

ThemeMetro avatar Aug 06 '24 03:08 ThemeMetro

@ThemeMetro Someone (TM) would need to take this PR and split it into smaller, focused PRs (with good test coverage). Would you be willing to do this?

oliverklee avatar Aug 06 '24 07:08 oliverklee