gutenberg icon indicating copy to clipboard operation
gutenberg copied to clipboard

Add/html character reference decoder

Open dmsnell opened this issue 3 years ago • 1 comments

What?

Adds a new HTML character reference decoder class, used by tag processor, for properly decoding HTML character references (entities). Leaves junk input in output, e.g. when HTML calls to replace a sequence with the replacement character U+FFFD (�) this decoder leaves in the raw input so that it won't change something it doesn't need to.

Why?

html_entity_decode() is an insufficient function in two ways:

  • it isn't aware of the ambiguous ampersand rule which leads to different decoding based on whether the encoded text comes from an HTML attribute or from data (or a few other contexts)
  • it doesn't properly decode all the allowable entities and variations allowed by HTML5
    • it requires a terminating ;
    • doesn't allow zero-extended prefixes to numerical character references e.g. &#x0000065
    • doesn't decode the C1 control character replacements, e.g. &#x80 is not the padding character

How?

Scans an input string for character entities and decodes them as numeric references or as named referenced, looking up names from the HTML5 spec

When performing named character decoding this decoder groups names by their first two letters, forming a naming "group." That group usually contains only a few named references. When finding the appropriate group, we iterate over the candidate names in that group to determine if the input contains that exact name match, and if we do, use that match to determine which text to replace in the input string.

Testing

Need to add tests to confirm this behavior.

Some basic tests so far with the HTML5 spec single-page.html shows very little or no noticeable impact on performance, but slightly increased memory use, probably because of how this is string-copying the class attribute for comparison. There are optimizations we could explore to avoid this allocation.

raw input how it should be decoded in an attribute (this PR) how it should be decoded in markup (this PR) how PHP decodes it
test test test test
&sirnotinthisfilm; &sirnotinthisfilm; &sirnotinthisfilm; &sirnotinthisfilm;
&#x0000065 e e &#x0000065
a&#x1F170b a&#x1F170b a&#x1F170b a&#x1F170b
a🅰b a🅰b a🅰b a🅰b
a🅰b a🅰b a🅰b a🅰b
a�b a�b a�b a�b
&#x10FFFCt 􏿼t 􏿼t &#x10FFFCt
&#x10FFFC5 &#x10FFFC5 &#x10FFFC5 &#x10FFFC5
&#1114101 􏿵 􏿵 &#1114101
&#1114101t 􏿵t 􏿵t &#1114101t
&#11141015 &#11141015 &#11141015 &#11141015
a&#x1F170x;b a🅰x;b a🅰x;b a&#x1F170x;b
a&#1337b aԹb aԹb a&#1337b
aԹb aԹb aԹb aԹb
a&ampb a&ampb a&b a&ampb
a&b a&b a&b a&b
a&notin a bind a&notin a bind a¬in a bind a&notin a bind
a∉b a∉b a∉b a∉b
a&notinb a&notinb a¬inb a&notinb
Ă Ă Ă Ă
&Abreve &Abreve &Abreve &Abreve
Á Á Á Á
&Aacute Á Á &Aacute
Ála carte Ála carte Ála carte Ála carte
&Aacutela carte &Aacutela carte Ála carte &Aacutela carte
&Aacute la carte Á la carte Á la carte &Aacute la carte
&Aacute=la carte &Aacute=la carte Á=la carte &Aacute=la carte
&Aacute*la carte Á*la carte Á*la carte &Aacute*la carte
€‡•œ €‡•œ €‡•œ €‡•œ
&#t; &#t; &#t; &#t;
�&#xdd70 �&#xdd70 �&#xdd70 �&#xdd70
   
   
&#x09&#x10&#x20 &#x10 &#x10 &#x09&#x10&#x20
&#x-65; &#x-65; &#x-65; &#x-65;

dmsnell avatar Jan 10 '23 21:01 dmsnell

Flaky tests detected in b8e94306c93f28b20dca4f5b6e6a15dbf00e2e9b. Some tests passed with failed attempts. The failures may not be related to this commit but are still reported for visibility. See the documentation for more information.

🔍 Workflow run URL: https://github.com/WordPress/gutenberg/actions/runs/3994110924 📝 Reported issues:

  • #39787 in specs/editor/various/multi-block-selection.test.js

github-actions[bot] avatar Jan 10 '23 22:01 github-actions[bot]

Replaced by https://github.com/WordPress/wordpress-develop/pull/6387

dmsnell avatar May 24 '24 23:05 dmsnell