wordpress-develop Token Map: Introduce an efficient lookup and translation class for string mappings.

Trac ticket: Core-60698 (previously Core-60841)

Motivated by the need to properly transform HTML named character references (like &) I found the need for a new semantic class which can efficiently perform search and replacement of a set of static tokens. Existing patterns in the codebase are not sufficient for the HTML need, and I suspect there are other use-cases where this class would help.

In #6387 I have built a spec-compliant HTML5 text decoder utilizing this token map. The performance of the new decoder is approximately 20% slower than calling html_entity_decode() directly, except that it properly decodes what PHP can't. In fact, the performance bottleneck in that PR comes from converting into UTF-8 the sequence of digits in numeric character references, not in looking up named character references.

This proposal is adding a new class WP_Token_Map providing at least two methods for normal use:

contains( $token ) returns whether the passed string is in the set.
read_token( $text, $offset = 0, $skip_bytes ) indicates if the character sequence starting at the given offset in the passed string forms a token in the set, and if so, returns the replacement for that token. It also sets &$skip_bytes to the length of the token so that calling code .

It also provides utility functions for pre-computing these classes, as they are designed for relatively-static cases where the actual code is intended to be generated dynamically, but stay static over time. For example, HTML5 defines the set of named character references and indicates that the list shall not change or be expanded. HTML5 spec. Precomputing can save on the startup-up cost of building the optimized lookup tables.

WP_Token_Map::from_array( array $mappings ) generates a new token map from the given array of whose keys are tokens and whose values are the replacements.
to_array() dumps the set of mapping into an array suitable for passing back into from_array().
WP_Token_Map::from_precomputed_table( ...$table ) instantiates a token set from a precomputed table, skipping the computation for building the table and sorting the tokens.
precomputed_php_source_table() generates PHP source code which can be loaded with the previous static method for maintenance of the core static token sets.

Other potential uses

Converting smilies like :)
Converting emoji sequences like :happy:
Determining if a given verb/action is allowed in an API call.

Oct 03 '23 18:10 dmsnell

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jonsurrell, gziolo, jorbin.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Mar 25 '24 16:03 github-actions[bot]

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance, it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

Mar 25 '24 17:03 github-actions[bot]

Rebased from e98508f10d17b606cf3f8761c82bb0f8750b263d to 502e4f5bcbe1271d57efe81aebe4ac19884e6cb7

May 08 '24 23:05 dmsnell

Some CI jobs report that some static test helpers are private and it can’t access them.

May 16 '24 16:05 gziolo

I've been reviewing the implementation and I have a concern. As far as I can tell, case-insensitive behavior is unreliable on some string strings, specifically multibyte strings with upper/lowercase variants like ü.

Have you considered this?

Yes, absolutely.

I did add case insensitivity during the testing cycle kind of as an afterthought, but I like having it. It's intended to aid in the purpose for which the class was made, which is focusing on ASCII casing. This may sound strange, but almost every basic casing functions are primarily if not entirely ASCII-oriented, because proper case-folding is a much more complicated topic, and no, mb_strtoupper() is not enough, at all.

So I think the expectation that this works for ü is more subjective. My gut feeling is that we can clarify this in documentation and let it be. There's no way this is going to be properly detecting more complicated case variants, but then again, nothing in this domain does. Otherwise we can remove it, but I like how the option gives us the ability to implement functions like WP_HTML_Processor::is_special() without allocating a new string and upper-casing it.

May 17 '24 17:05 dmsnell

It seems fine to me if we don't support case insensitive non-ascii token matching as long as we're clear about it.

May 17 '24 18:05 sirreal

It seems fine to me if we don't support case insensitive non-ascii token matching as long as we're clear about it.

Ha! I went to add these and I already noted this in the function docblock.

'case-insensitive' to ignore ASCII case or default of 'case-sensitive'

Text issues are apparently always on my mind. Let me know if you still think this is unclear. I'm worried that if we try and add more documentation when it's already there right at the cursor when calling the function, then it won't be any more helpful, just more noise.

May 17 '24 18:05 dmsnell

I already noted this in the function docblock … Let me know if you still think this is unclear.

I think this is fine. I was searching for words like "mb" or "multibyte", but it seems like ASCII is correct 👍

I did notice a changelog entry on the PHP documentation page that makes me wonder if we might need to support PHP versions that have varying behavior on non-ASCII treatment. I wasn't able to trigger any differing behavior myself, but do you know what this is about?

8.2.0 Case conversion no longer depends on the locale set with setlocale(). Only ASCII characters will be converted.

That seems like exactly what we want, but I've looked around PHP source, the changelog, etc. and I haven't managed to find exactly what changed or examples of the different.

May 20 '24 10:05 sirreal

I did notice a changelog entry on the PHP documentation page that makes me wonder if we might need to support PHP versions that have varying behavior on non-ASCII treatment. I wasn't able to trigger any differing behavior myself, but do you know what this is about?

I've struggled to test this with the HTML API. in effect, some system locales could cause, I believe, a few characters to be case-folded differently, but this is such a broken system I'm not of the opinion that it's a big risk here. I'm willing to let this sit as a hypothetical bug until we have a clearer understanding of it.

May 20 '24 21:05 dmsnell

One suggestion that can be done in a follow up, I think automating pulling https://html.spec.whatwg.org/entities.json and checking to see if the autogenerated file has been changed would be beneficial. Can likely be done on a weekly schedule and only in trunk.

this is a good suggestion, @aaronjorbin - and thanks for the review! (I'll be addressing the other feedback inline where you left it).

my one main reluctance to automate the download is that the specification requires that the list never change, and automating it only adds a point of failure to the process, well, a point of failure plus some latency.

do you think that it's worth adding this additional complexity in order to change what is defined to not change? I believe that a wide cascade of failures would occur around the internet at large if we found new character references in HTML.

I'm happy to add this if there are strong feelings about it, though if that's the case, maybe we could consider it a follow-up task so as not to delay this functionality? a separate task where we can discuss the merits of the automation?

May 21 '24 23:05 dmsnell

I think we can add a new token-map group. It also might be worth putting in the html-api group since this is where the bigger effort lies

@aaronjorbin following the html5lib test suite, I added the wpTokenMap test module to the @group html-api-token-map group. happy to change if this isn't idea; I was looking for more examples and didn't see. it can also be set as a @package and @subpackage but I was looking for @subgroup and didn't find any

May 21 '24 23:05 dmsnell

@aaronjorbin following the html5lib test suite, I added the wpTokenMap test module to the @group html-api-token-map group. happy to change if this isn't idea; I was looking for more examples and didn't see. it can also be set as a @package and @subpackage but I was looking for @subgroup and didn't find any

@subgroup isn't an annotation that is a part of PHPUnit. I think this group makes sense, thanks!

May 22 '24 20:05 aaronjorbin

One suggestion that can be done in a follow up, I think automating pulling https://html.spec.whatwg.org/entities.json and checking to see if the autogenerated file has been changed would be beneficial. Can likely be done on a weekly schedule and only in trunk.

this is a good suggestion, @aaronjorbin - and thanks for the review! (I'll be addressing the other feedback inline where you left it).

my one main reluctance to automate the download is that the specification requires that the list never change, and automating it only adds a point of failure to the process, well, a point of failure plus some latency.

do you think that it's worth adding this additional complexity in order to change what is defined to not change? I believe that a wide cascade of failures would occur around the internet at large if we found new character references in HTML.

I'm happy to add this if there are strong feelings about it, though if that's the case, maybe we could consider it a follow-up task so as not to delay this functionality? a separate task where we can discuss the merits of the automation?

Upon reading https://github.com/whatwg/html/blob/main/FAQ.md#html-should-add-more-named-character-references I withdrawal this suggestion. I do think it might make sense to link that and mention that the entities json should never change in the readme for others that come across this with a similar thinking as me.

May 22 '24 20:05 aaronjorbin

I do think it might make sense to link that and mention that the entities json should never change in the readme for others that come across this with a similar thinking as me.

Thanks again @aaronjorbin!

It's already mentioned at the top of the README, in the character reference class (and as well in the autogenerator script where it comes from), the PR description, and the Trac ticket description.

Is that good enough or was there somewhere else you wanted it? Maybe the way I wrote it wasn't as direct or clear as it should be?

May 22 '24 21:05 dmsnell

Merged in [58188] https://github.com/wordpress/wordpress-develop/commit/bcd25b14ec0208657af1299b6df5642b5c0260a9

May 23 '24 20:05 dmsnell

wordpress-develop wordpress-develop copied to clipboard

Token Map: Introduce an efficient lookup and translation class for string mappings.

Other potential uses

Test using WordPress Playground

Some things to be aware of

wordpress-develop
wordpress-develop copied to clipboard