wordpress-develop icon indicating copy to clipboard operation
wordpress-develop copied to clipboard

Introduce custom UTF-8 decoding pipeline.

Open dmsnell opened this issue 1 year ago • 1 comments

Status

This is an exploratory patch so far.

Tasks

  • [ ] Test behavior of new functions.
  • [ ] Benchmark performance.

Motivation

  • _mb_strlen() attempts to split a string on UTF-8 boundaries, falling back to assumed character patterns if it can't run Unicode-supported PCRE patterns.
  • wp_check_invalid_utf8() performs similar PCRE-based counting.
  • If sending HTML in any encoding other than UTF-8, it's important not to perform basic conversion with a function like mb_convert_encoding() or iconv() because these can lead to data loss for characters not representable in the target encoding. Although mb_encode_numericentity() exists with the multi-byte extension, having a streaming UTF-8 decoder would allow WordPress to handle proper conversion to numeric character references natively and universally. E.g. converting to …
  • URL detection and XML name parsing requires detecting sequences of bytes with specific Unicode ranges, like a Unicode-aware strspn(), and should stop as soon as any given character is outside of that range.

Description

WordPress relies on various extensions, regular expressions, and basic string operations when working with text potentially encoded as UTF-8.

In this patch an efficient UTF-8 decoding pipeline is introduced which can remove these dependencies, normalize all decoding behaviors, and open up new kinds of processing opportunities.

The decoder was taken from Björn Höhrmann. While it may be possible that other methods are more efficient, such as in the multi-byte extension, this decoder provides a streamable interface useful for more flexible kinds of processing: for example, whether or not to replace invalid byte sequences, zero-memory-overhead code point counting, and partially decoding strings.

dmsnell avatar Jun 24 '24 02:06 dmsnell

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance, it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

github-actions[bot] avatar Jun 24 '24 02:06 github-actions[bot]

Closed by c9166919cce1f78fc10de220a92970f9448e03dd [60768]

dmsnell avatar Sep 16 '25 12:09 dmsnell