Add/html character reference decoder
What?
Adds a new HTML character reference decoder class, used by tag processor, for properly decoding HTML character references (entities). Leaves junk input in output, e.g. when HTML calls to replace a sequence with the replacement character U+FFFD (�) this decoder leaves in the raw input so that it won't change something it doesn't need to.
Why?
html_entity_decode() is an insufficient function in two ways:
- it isn't aware of the ambiguous ampersand rule which leads to different decoding based on whether the encoded text comes from an HTML attribute or from data (or a few other contexts)
- it doesn't properly decode all the allowable entities and variations allowed by HTML5
- it requires a terminating
; - doesn't allow zero-extended prefixes to numerical character references e.g.
e - doesn't decode the C1 control character replacements, e.g.
€is€not the padding character
- it requires a terminating
How?
Scans an input string for character entities and decodes them as numeric references or as named referenced, looking up names from the HTML5 spec
When performing named character decoding this decoder groups names by their first two letters, forming a naming "group." That group usually contains only a few named references. When finding the appropriate group, we iterate over the candidate names in that group to determine if the input contains that exact name match, and if we do, use that match to determine which text to replace in the input string.
Testing
Need to add tests to confirm this behavior.
Some basic tests so far with the HTML5 spec single-page.html shows very little or no noticeable impact on performance, but slightly increased memory use, probably because of how this is string-copying the class attribute for comparison. There are optimizations we could explore to avoid this allocation.
| raw input | how it should be decoded in an attribute (this PR) | how it should be decoded in markup (this PR) | how PHP decodes it |
|---|---|---|---|
test |
test |
test |
test |
&sirnotinthisfilm; |
&sirnotinthisfilm; |
&sirnotinthisfilm; |
&sirnotinthisfilm; |
e |
e |
e |
e |
a� |
a� |
a� |
a� |
a🅰b |
a🅰b |
a🅰b |
a🅰b |
a🅰b |
a🅰b |
a🅰b |
a🅰b |
a�b |
a�b |
a�b |
a�b |
􏿼t |
t |
t |
􏿼t |
� |
� |
� |
� |
􏿵 |
|
|
􏿵 |
􏿵t |
t |
t |
􏿵t |
� |
� |
� |
� |
a🅰x;b |
a🅰x;b |
a🅰x;b |
a🅰x;b |
aԹb |
aԹb |
aԹb |
aԹb |
aԹb |
aԹb |
aԹb |
aԹb |
a&b |
a&b |
a&b |
a&b |
a&b |
a&b |
a&b |
a&b |
a¬in a bind |
a¬in a bind |
a¬in a bind |
a¬in a bind |
a∉b |
a∉b |
a∉b |
a∉b |
a¬inb |
a¬inb |
a¬inb |
a¬inb |
Ă |
Ă |
Ă |
Ă |
&Abreve |
&Abreve |
&Abreve |
&Abreve |
Á |
Á |
Á |
Á |
Á |
Á |
Á |
Á |
Ála carte |
Ála carte |
Ála carte |
Ála carte |
Ála carte |
Ála carte |
Ála carte |
Ála carte |
Á la carte |
Á la carte |
Á la carte |
Á la carte |
Á=la carte |
Á=la carte |
Á=la carte |
Á=la carte |
Á*la carte |
Á*la carte |
Á*la carte |
Á*la carte |
€‡•œ |
€‡•œ |
€‡•œ |
€‡•œ |
&#t; |
&#t; |
&#t; |
&#t; |
�� |
�� |
�� |
�� |
 |
 |
 |
 |
 |
 |
 |
 |
	  |
 |
 |
	  |
&#x-65; |
&#x-65; |
&#x-65; |
&#x-65; |
Flaky tests detected in b8e94306c93f28b20dca4f5b6e6a15dbf00e2e9b. Some tests passed with failed attempts. The failures may not be related to this commit but are still reported for visibility. See the documentation for more information.
🔍 Workflow run URL: https://github.com/WordPress/gutenberg/actions/runs/3994110924 📝 Reported issues:
- #39787 in
specs/editor/various/multi-block-selection.test.js
Replaced by https://github.com/WordPress/wordpress-develop/pull/6387